Category: Data Science

The Art of Data Simulation

Most problems in the scientific world are about understanding different phenomena. We want to learn the characteristics and patterns of the systems we study to be able to preview and predict behavior. As humans, we learn by observing these processes when they happen naturally or with controlled experiments. This might not be an option if we are studying a rare or dangerous event.

Read More

Spatial Objects Using TERR

A key part of analytics in the oil and gas industry is evaluating opportunities at different locations. Space is always present when looking for profitable development projects. We usually look at the already in production wells and try to find some spatial trends. To stay competitive, we need to find better ways  to access the  data of different areas and its wells. For instance, we can transform the spatial information to compact objects that store the location and shape of each well and lease. These objects can be feed  to different calculations and analyses as geometries. For Spotfire, it also has some advantages, you can use the feature layers of the map chart. In this case, we can visualize the leases as polygons and wells as lines.

Read More

NFL: Predicting 2018 Win Totals with Data Science

With the Super Bowl just behind us, it’s time to predict wins for the 2018 NFL Season.  At the start of the playoffs, we looked at a model which predicted how many games NFL teams should have won in 2017 and compared our results to Football Outsider’s Pythagorean Win Expectancy.  We were able to improve on Pythagorean Win Expectancy for last year’s results, aka how many games a team should have won, but our backwards looking models were unable to beat Pythagorean Win Expectancy in predicting next year’s wins.  Today, we will build some models trying specifically to predict how many games teams will win next year.

If you simply want to know how many games your team will win in 2018, strictly for recreational purposes of course, you can skim to the end or check out our Spotfire Template.  But, for Football Outsiders fans, those interested in what makes up wins and losses, or those interested in the Data Science process, read on.

Read More

Jason is a Junior Data Scientist at Ruths.ai with a Master’s degree in Predictive Analytics and Data Science from Northwestern University. He has experience with a multitude of machine learning techniques such as Random Forest, Neural Nets, and Hidden Markov Models. With a previous Master’s in Creative Writing, Jason is a fervent believer in the Oxford comma.

How many games should your NFL team have won this season?

How many games should your NFL team have won this season?  Everyone knows a lucky bounce here and a bad call there can have a significant impact on the win-loss bottom line.  Hard core fans of Sports Analytics would recognize this factor as the driver behind Pythagorean Win Totals, a statistic derived to measure true performance.  Today, we are going to look to see if we can beat Pythagorean Win Totals as a predictor for how many games a team won in a certain season. IE, how many games should your team have won.

Spoiler:  we can make a better predictor, but in a way that makes us re-evaluate our understanding of Pythagorean Win Totals.

If you simply want to know how many games your team should have won, you can go straight to our Spotfire Template.  But, for Football Outsiders fans or those more interested in what makes up wins and losses, read on.

Read More

Jason is a Junior Data Scientist at Ruths.ai with a Master’s degree in Predictive Analytics and Data Science from Northwestern University. He has experience with a multitude of machine learning techniques such as Random Forest, Neural Nets, and Hidden Markov Models. With a previous Master’s in Creative Writing, Jason is a fervent believer in the Oxford comma.

Wrangling Data Science in Oil & Gas: Merging MongoDB and Spotfire

Data science in Oil and Gas is central stage as operators work in the new “lower for longer” price environment. Want to see what happens when you solve data science questions with the hottest new database and powerful analytics of Spotfire? Read on to learn about our latest analytics module, the DCA Wrangler. If you want to see it in action, scroll down to watch the video.

Layering Data Science on General Purpose Data & Analytics

Ruths.ai is a startup focused on energy analytics and technical data science. We are both TIBCO and MongoDB partners, heavily leveraging these two platforms to solve real-world problems revolving around the application of data science at scale and within the enterprise environment. I started our plucky outfit a little under four years ago. We’ve done a lot of neat things with Spotfire including analyzing seismic, and well log data. Here, we’ll look at competitor/production data.

The Document model allows for flexible and powerful encoding of decline curve models.

MongoDB provides a powerful and scalable general purpose database system. TIBCO provides tested and forward thinking general purpose analytics platforms for both streaming and data at rest. They also provide great infrastructure products which isn’t in focus in this blog.

Ruths.ai provides the domain knowledge and we infuse our proprietary algorithms and data structures for solving common analytics problems into products that leverage the TIBCO and MongoDB platforms.

We believe that these two platforms can be combined to solve innumerable problems in the technical industries represented by our readers. TIBCO provides the analytics and visualization while MongoDB provides the database. This is a powerful marriage for problems involving analytics, single view or IOT.

In this blog, I want to dig into a specific and fundamental problem within oil and gas and how we leveraged TIBCO Spotfire and MongoDB to solve it — namely Autocasting.

What is Autocasting?

Oil reserves denote the amount of crude oil that can be technically recovered at a cost that is financially feasible at the present price of oil. Crude oil resides deep underground and must be extracted using wells and completion techniques. Horizontal wells can stretch two miles within a vertical window the height of most office floors.

For those with E&P experience, I’m going to elide some important details, like using “oil” for “hydrocarbons” and other technical nomenclature.

Because the geology of the subsurface cannot be examined directly, indirect techniques must be used to estimate the size and recoverability of the resource. One important indirect technique is called decline curve analysis (DCA), which is a mathematical model that we fit to historical production data to forecast reserves. DCA is so prevalent in oil and gas that we use it for auditing, booking, competitor analysis, workover screening, company growth and many other important tasks. With the rise of analytics, it has therefore become a central piece in any multi-variate workflow looking to find the key drivers for well and resource performance.

The DCA Wrangler provides fast autocasting and storage of decline curves. Actual data (solid) is modeled using best-fit optimization on mathematical models (dashed line forecast).

At the heart of any resource assessment model is a robust “autocasting” method. Autocasting is the automatic application of DCA to large ensembles of wells, rather than one at a time.
But there’s a problem. Incumbent technologies make the retrieval of decline curves and their parameters very difficult. Decline curve models are complex mathematical forecasts with many components and variation. Retrieving models from a SQL database often requires parsing text expressions. And interacting with many tables within a database.

Further, with the rise of unconventionals, the fundamental workflow of resource assessment through decline curves is being challenged. Spotfire has become a popular tool for revamping and making next generation decline curve analysis solutions.

Autocasting in Action

What I am going to demonstrate is a new autocast workflow that would not be possible without the combined performance and capability of MongoDB and Spotfire. I’ll be demonstrating using our DCA Wrangler product – which is one of over 250 analytics workflows that we provide through a comprehensive subscription.

Its important to note that software exists to decline wells and database their results. People have even declined wells in Spotfire before. What I hope you see in our new product is the step change in performance, ease-of-use, and enablement when you use MongoDB as the backend.

What’s Next?

First, we have a home run solution for decline curves that requires a MongoDB backend. In the near future, more vendor companies will be leveraging Mongo as their backend database.

Second, I hope you see the value in MongoDB for storing and retrieving technical data and analytic results, especially within powerful tools like Spotfire. Plus, how easy it is to set up and use.

And Lastly, I hope you get excited about the other problems that can be solved by marrying TIBCO with MongoDB – imagine using Streambase as your IOT processor and MongoDB as your deposition environment. Or even store models and sensor data within Mongo and use Spotfire to tweak model parameters and co-visualize data.

If you’re interested in learning more about our subscription, get registered today.

Let’s make data great again.

You’ll conquer the present suspiciously fast if you smell of the future….and stink of the past.

TERR — Converting strings to date and time

This post explains my struggle to convert strings to Date or Time with TERR.  I recently spent so much time on this that I thought it deserved a blog post.  Here’s the story…

I was recently working on a TERR data function that calls a publicly available API and brings all the data into a table.  I used the as.data.frame function to parse out my row data.  In that function, I used the stringsAsFactors = FALSE argument, and as a result (the desired result), all of my data came back as strings.  This was fine because the API included column metadata with the data type.  As you can see in the script below, I planned on “sapplying” through the metadata with as.POSIXct and as.numeric.  This worked just fine in RStudio, and it also worked for the numeric columns and for the DateTime columns.  However, it did not work for Date and Time columns.  I tried different syntax, functions (as.Date didn’t work either), packages, etc to get it to work and NOTHING!  The struggle was very real.

Script convert strings to Date or Time with TERR

 

Finally, I Googled the right terms and came across a TIBCO knowledge base article with this information….

Spotfire data functions recognize TERR objects of class “POSIXct” as date/time information. As designed, the Spotfire/TERR data function interface for date/time information does the following:

– Converts a Spotfire value or column whose DataType is “Date”, “Time” or “DateTime” into a TERR object of class “POSIXct”.

– Converts a TERR object of class “POSIXct” into a Spotfire value or column with a DataType of “DateTime”, which can then be formatted in Spotfire to display only the date (or to display only the time) if needed.

This interface does not use any other TERR object classes (such as the “Date” class in TERR) to transfer date/time information between Spotfire and TERR.

That told me that all my effort was for naught, and it just wasn’t possible.  I contacted TIBCO just to make sure there wasn’t some other solution out there that the article was not addressing.  In the end, I just used a transformation on the Date and Time columns to change the data type.  I hope that you, dear Reader, find this post before you spend hours on the same small problem.  I did put in an enhancement request.  Fingers crossed.  Please let me know if you have a better method!

 

 

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!

Data Science Toolkit Improvements

This week, I was able to test out the latest and greatest changes to the Ruths.ai Data Science Toolkit.  New options and features allow users to easily split test and training data sets prior to model building, as all good data scientists should!  This new functionality speeds up your analysis by making model build and evaluation faster and more efficient.  I worked up this video to demonstrate.

 Data Science Toolkit for Spotfire

The Data Science Toolkit brings the power of advanced data science to Spotfire.  Ruths.ai designed it with simplicity and efficiency in mind to support a wide range of analytics applications. This extension is coupled with comprehensive training that provides both beginner and experienced users a strong foothold in data science analysis.  The Data Science Toolkit is available to Premium subscribers.  Once deployed on your Spotfire server, quickly and easily access the toolkit via the Tools menu as shown below.  Find out more, including videos, at this link.

Data Science Toolkit menu

Please feel free to reach out to me or anyone else on the Ruths.ai team to learn more about this amazing product.  We love to talk about it!

 

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!

Using Spotfire’s Data Relationships Tool to check for Multicollinearity

When building a Multiple Linear Regression model, we want to limit the correlation between predictor (X) variables.  Luckily, Spotfire has a tool that makes identifying the correlation (called multicollinearity) effortless.  I will walk you through the tool, and you can see the resulting template here.

Read More

Jason is a Junior Data Scientist at Ruths.ai with a Master’s degree in Predictive Analytics and Data Science from Northwestern University. He has experience with a multitude of machine learning techniques such as Random Forest, Neural Nets, and Hidden Markov Models. With a previous Master’s in Creative Writing, Jason is a fervent believer in the Oxford comma.

Real Estate Secrets: Hidden Trend Visualization

Everyone who has ever owned or lived in a house knows at least a little bit about the whims of the real estate market. Big houses cost more, neighborhood matters, proximity to basic services is great, age and style are important in some markets, you name it. But what is it that matters the most? This is a question that visualization can help us answer.

Read More

Using Support Vector Machines in Spotfire

(Image Source: opencv.org)

Support Vector Machines (SVMs) is one of the most popular and most widely used machine learning algorithms today. It is robust and allows us to tackle both classification and regression problems. In general, SVMs can be relatively easy to use, have good generalization performance, and often do not require much tuning. Follow this link for further information regarding support vector machines. To help illustrate the power of SVMs, we thought it would be useful to go through an example using a custom template we have created for SVMs.

Read More

Emanuel holds a Master of Science Degree in Data Science and Analytics from the University of Oklahoma and brings years of experience in production engineering.