How many games should your NFL team have won this season? Everyone knows a lucky bounce here and a bad call there can have a significant impact on the win-loss bottom line. Hard core fans of Sports Analytics would recognize this factor as the driver behind Pythagorean Win Totals, a statistic derived to measure true performance. Today, we are going to look to see if we can beat Pythagorean Win Totals as a predictor for how many games a team won in a certain season. IE, how many games should your team have won.
Spoiler: we can make a better predictor, but in a way that makes us re-evaluate our understanding of Pythagorean Win Totals.
If you simply want to know how many games your team should have won, you can go straight to our Spotfire Template. But, for Football Outsiders fans or those more interested in what makes up wins and losses, read on.
Data science in Oil and Gas is central stage as operators work in the new “lower for longer” price environment. Want to see what happens when you solve data science questions with the hottest new database and powerful analytics of Spotfire? Read on to learn about our latest analytics module, the DCA Wrangler. If you want to see it in action, scroll down to watch the video.
Layering Data Science on General Purpose Data & Analytics
Ruths.ai is a startup focused on energy analytics and technical data science. We are both TIBCO and MongoDB partners, heavily leveraging these two platforms to solve real-world problems revolving around the application of data science at scale and within the enterprise environment. I started our plucky outfit a little under four years ago. We’ve done a lot of neat things with Spotfire including analyzing seismic, and well log data. Here, we’ll look at competitor/production data.
MongoDB provides a powerful and scalable general purpose database system. TIBCO provides tested and forward thinking general purpose analytics platforms for both streaming and data at rest. They also provide great infrastructure products which isn’t in focus in this blog.
Ruths.ai provides the domain knowledge and we infuse our proprietary algorithms and data structures for solving common analytics problems into products that leverage the TIBCO and MongoDB platforms.
We believe that these two platforms can be combined to solve innumerable problems in the technical industries represented by our readers. TIBCO provides the analytics and visualization while MongoDB provides the database. This is a powerful marriage for problems involving analytics, single view or IOT.
In this blog, I want to dig into a specific and fundamental problem within oil and gas and how we leveraged TIBCO Spotfire and MongoDB to solve it — namely Autocasting.
What is Autocasting?
Oil reserves denote the amount of crude oil that can be technically recovered at a cost that is financially feasible at the present price of oil. Crude oil resides deep underground and must be extracted using wells and completion techniques. Horizontal wells can stretch two miles within a vertical window the height of most office floors.
For those with E&P experience, I’m going to elide some important details, like using “oil” for “hydrocarbons” and other technical nomenclature.
Because the geology of the subsurface cannot be examined directly, indirect techniques must be used to estimate the size and recoverability of the resource. One important indirect technique is called decline curve analysis (DCA), which is a mathematical model that we fit to historical production data to forecast reserves. DCA is so prevalent in oil and gas that we use it for auditing, booking, competitor analysis, workover screening, company growth and many other important tasks. With the rise of analytics, it has therefore become a central piece in any multi-variate workflow looking to find the key drivers for well and resource performance.
At the heart of any resource assessment model is a robust “autocasting” method. Autocasting is the automatic application of DCA to large ensembles of wells, rather than one at a time.
But there’s a problem. Incumbent technologies make the retrieval of decline curves and their parameters very difficult. Decline curve models are complex mathematical forecasts with many components and variation. Retrieving models from a SQL database often requires parsing text expressions. And interacting with many tables within a database.
Further, with the rise of unconventionals, the fundamental workflow of resource assessment through decline curves is being challenged. Spotfire has become a popular tool for revamping and making next generation decline curve analysis solutions.
Autocasting in Action
What I am going to demonstrate is a new autocast workflow that would not be possible without the combined performance and capability of MongoDB and Spotfire. I’ll be demonstrating using our DCA Wrangler product – which is one of over 250 analytics workflows that we provide through a comprehensive subscription.
Its important to note that software exists to decline wells and database their results. People have even declined wells in Spotfire before. What I hope you see in our new product is the step change in performance, ease-of-use, and enablement when you use MongoDB as the backend.
First, we have a home run solution for decline curves that requires a MongoDB backend. In the near future, more vendor companies will be leveraging Mongo as their backend database.
Second, I hope you see the value in MongoDB for storing and retrieving technical data and analytic results, especially within powerful tools like Spotfire. Plus, how easy it is to set up and use.
And Lastly, I hope you get excited about the other problems that can be solved by marrying TIBCO with MongoDB – imagine using Streambase as your IOT processor and MongoDB as your deposition environment. Or even store models and sensor data within Mongo and use Spotfire to tweak model parameters and co-visualize data.
This post explains my struggle to convert strings to Date or Time with TERR. I recently spent so much time on this that I thought it deserved a blog post. Here’s the story…
I was recently working on a TERR data function that calls a publicly available API and brings all the data into a table. I used the as.data.frame function to parse out my row data. In that function, I used the stringsAsFactors = FALSE argument, and as a result (the desired result), all of my data came back as strings. This was fine because the API included column metadata with the data type. As you can see in the script below, I planned on “sapplying” through the metadata with as.POSIXct and as.numeric. This worked just fine in RStudio, and it also worked for the numeric columns and for the DateTime columns. However, it did not work for Date and Time columns. I tried different syntax, functions (as.Date didn’t work either), packages, etc to get it to work and NOTHING! The struggle was very real.
Spotfire data functions recognize TERR objects of class “POSIXct” as date/time information. As designed, the Spotfire/TERR data function interface for date/time information does the following:
– Converts a Spotfire value or column whose DataType is “Date”, “Time” or “DateTime” into a TERR object of class “POSIXct”.
– Converts a TERR object of class “POSIXct” into a Spotfire value or column with a DataType of “DateTime”, which can then be formatted in Spotfire to display only the date (or to display only the time) if needed.
This interface does not use any other TERR object classes (such as the “Date” class in TERR) to transfer date/time information between Spotfire and TERR.
That told me that all my effort was for naught, and it just wasn’t possible. I contacted TIBCO just to make sure there wasn’t some other solution out there that the article was not addressing. In the end, I just used a transformation on the Date and Time columns to change the data type. I hope that you, dear Reader, find this post before you spend hours on the same small problem. I did put in an enhancement request. Fingers crossed. Please let me know if you have a better method!
This week, I was able to test out the latest and greatest changes to the Ruths.ai Data Science Toolkit. New options and features allow users to easily split test and training data sets prior to model building, as all good data scientists should! This new functionality speeds up your analysis by making model build and evaluation faster and more efficient. I worked up this video to demonstrate.
Data Science Toolkit for Spotfire
The Data Science Toolkit brings the power of advanced data science to Spotfire. Ruths.ai designed it with simplicity and efficiency in mind to support a wide range of analytics applications. This extension is coupled with comprehensive training that provides both beginner and experienced users a strong foothold in data science analysis. The Data Science Toolkit is available to Premium subscribers. Once deployed on your Spotfire server, quickly and easily access the toolkit via the Tools menu as shown below. Find out more, including videos, at this link.
Please feel free to reach out to me or anyone else on the Ruths.ai team to learn more about this amazing product. We love to talk about it!
When building a Multiple Linear Regression model, we want to limit the correlation between predictor (X) variables. Luckily, Spotfire has a tool that makes identifying the correlation (called multicollinearity) effortless. I will walk you through the tool, and you can see the resulting template here.
Everyone who has ever owned or lived in a house knows at least a little bit about the whims of the real estate market. Big houses cost more, neighborhood matters, proximity to basic services is great, age and style are important in some markets, you name it. But what is it that matters the most? This is a question that visualization can help us answer.
Support Vector Machines (SVMs) is one of the most popular and most widely used machine learning algorithms today. It is robust and allows us to tackle both classification and regression problems. In general, SVMs can be relatively easy to use, have good generalization performance, and often do not require much tuning. Follow this link for further information regarding support vector machines. To help illustrate the power of SVMs, we thought it would be useful to go through an example using a custom template we have created for SVMs.
Two weeks ago, I published a Linear and Logistic Regression template on Exchange.ai that can be found here. When I built the template, my process was as follows:
Add test and training data sets
Build model on training data set
Insert predicted column based on model in test data set
When following this process for the logistic regression model (a classification model), it inserts two columns of data — ProbPrediction and ClassPrediction. These two columns give a prediction and a probability. I noticed that some records contained a value for the ClassPrediction but not the ProbPrediction, which seemed odd. This happened in records where one or more of my predictor columns were null, in which case, neither column should have been populated.
It turns out that this is a bug that can be fixed with the steps below.
Go to the Tools menu and select TERR Tools
Click the Launch TERR Console button
Type q() to exit the program
Close the program and relaunch
See below for a screen shot of the console.
After I relaunched Spotfire and reran the model, I saw consistent population of the ProbPrediction and ClassPrediction columns. If you have any questions, feel free to contact me at firstname.lastname@example.org.