The effect of uncertainty in look-alike modeling (KNN): Examples from Amazon and in predicting well performance

Many methods and models from the broader data science community can be implemented in a range of industries and use cases – including oil and gas. One technique that gets used in a lot of places is look-alike modeling. We’ve experienced it on platforms like Netflix, Amazon, and Facebook whenever recommendations are made. While Amazon doesn’t know all the things you like, it knows what you purchased in the past and has information about what other people have bought. Amazon can use that information to make some inferences about you.

Look-alike modeling has a very strong parallel in the oil and gas industry. We often want to infer something about a well’s expected performance based off older, analog wells. Historical performance of offset wells can be used to model different landing zones or different completions design parameters. That’s effectively look-alike modeling: “I have something, I’m going to make some judgments, some predictions, based on things that look very similar.”

What is look-alike modeling? How is it actually done?

One of the methods typically used is a simple algorithm called K-nearest neighbors or KNN. The idea is to take a point you’re interested in and find the neighbors that are closest to it. That doesn’t necessarily mean in geographic space. You may be interested in attributes, like age, hobbies, or various other characteristics (e.g. frac intensity or rock porosity). However, whether looking at people’s preferences or the characteristics of a reservoir, we need to remember that there are uncertainties in these parameters. Our ability to estimate these features is not necessarily guaranteed.

Amazon Example: Fitness interest and budget

Suppose we have a number of existing Amazon users – we know what they’ve bought in the past. We have a new user and want to make recommendations. Which existing users should serve as the basis for our recommendations? There are many attributes we could choose to group users, but in this example, we’ll look at their interest in fitness and their monthly spend on Amazon.

  • Fitness: 0 (least interested in fitness) to 1 (most interest in fitness)
  • Monthly spend on Amazon: 0 (buy nothing) to 1 (spend $1000+/month)

Here’s our plot of users. Every dot is a user and the big dot is our new user.

Let’s start by assuming that our measurements of the users interest in fitness and monthly spend are exactly right. In this case, we just look for the users who are absolutely as close to our new user as possible. Thus, if we wanted to take a sample of five existing users we would pick the five green dots.

Accounting for Measurement Error

If we’re really trying to make predictions, we should acknowledge that there may be some limitation to our ability to predict. In the case of our fitness example, let’s say our measurements of fitness and monthly budget aren’t 100% accurate.

  • Let’s assume we haven’t sold much fitness equipment yet. In that case, we’re probably systematically underestimating everyone’s fitness interest… asymmetric error (Figure 1)
  • The actual monthly budget could be a bit over or a bit under… symmetric error (Figure 2)

To account for these measurement errors, we must adjust our KNN algorithm. The difference in asymmetric vs. symmetric measurement errors impacts the way my predictions work. Our previous KNN algorithm assumed each point was exact. We can now treat each point as fuzzy. We’ll randomly pull a sample error from each of the distributions to adjust the input into our KNN.

Figure 1: Fitness error is asymmetric as shown above (mean is shifter by .05 with a standard deviation of .05).

Figure 2: Budget error is symmetric and so distributed around zero (mean is at zero, standard deviation is .0025).

We can run our modified nearest neighbors algorithm with our adjustment, and we see that the K-nearest users are different. We haven’t shifted the location of the points, but have incorporated the measurement uncertainty into selecting the nearest neighbors.

Analog Well Identification

For a new or potential well, we can build a predictive model of future production based off similar wells. The model is very similar to the Amazon example. There are many attributes we could select for the basis of identifying an analog well. In this example, we’ll take porosity and pyrolysis as two features we can use to evaluate the potential of a particular well.

Amazon (~310 million active customers) and Netflix (~150 million accounts) have so many data points they can nail down their prediction algorithms. The world is different if we’re dealing with sparse data as we do in oil and gas. Although a single well can generate terabytes of data over its lifetime, each well still represents only one data point, one unique combination of attributes that can be used as an input. With approximately 250,000 unconventional wells in the US or even ~950,000 total wells we have a much smaller pool from which to draw neighbors. We can’t just simply let the data speak for itself, because there are biases. To account for uncertainty in these attributes we will use a random sampling from a normal error distribution on both porosity and pyrolysis.

As the figures above show, if we don’t apply an error adjustment, we end up selecting certain wells, but if we apply an error adjustment, we collect very different wells. In an environment in which the data is sparse, and the cost or investment in each decision or prediction is high, taking into account measurement error is very important.

This isn’t just about showing somebody the next bestseller or blockbuster they might enjoy. We’re going to invest $10 million, and we want to use the model to help. With so much riding on this decision, it’s worth taking time to account for uncertainty so we incorporate the right features when we’re doing the modeling. This will yield richer models, leveraging data with a statistical sophistication as a way of really making the most informed decisions and making decisions that are robust, regardless of the tools that you’re using.

Key takeaways:

  1. Accounting for uncertainty is important – particularly when data is sparse, and predictions are high-value.
  2. Sampling/iteration is a standard technique for incorporating uncertainty in predictions.
  3. Accounting for measurement uncertainty hasn’t yet become common in O&G, but we should be doing taking it into account.

In case you’re wondering how this analysis and these plots were generated don’t worry – I’ll soon be posting a Jupyter Notebook with all the code.

Leave a Reply

Your email address will not be published. Required fields are marked *