How many games should your NFL team have won this season?

How many games should your NFL team have won this season?  Everyone knows a lucky bounce here and a bad call there can have a significant impact on the win-loss bottom line.  Hard core fans of Sports Analytics would recognize this factor as the driver behind Pythagorean Win Totals, a statistic derived to measure true performance.  Today, we are going to look to see if we can beat Pythagorean Win Totals as a predictor for how many games a team won in a certain season. IE, how many games should your team have won.

Spoiler:  we can make a better predictor, but in a way that makes us re-evaluate our understanding of Pythagorean Win Totals.

If you simply want to know how many games your team should have won, you can go straight to our Spotfire Template.  But, for Football Outsiders fans or those more interested in what makes up wins and losses, read on.

Read More

Memories from the Houston Astros World Series Championship

We interrupt this analytically, data focused blog to attempt a little tug at the heart strings.  After all, Ruths.ai is a Houston proud company, and we all went through Hurricane Harvey and the subsequent Astros World Series run that brought the city together.  While this article might not delve into analytics, its subject–the 2017 World Series Champion Houston Astros–certainly serves as a model for how an analytically focused enterprise should run.

This article first appeared Friday, November 17 at Astros County, written by myself, our resident Astros fanatic.

Read More

Creating Information Links with Spatial Data

  • Do you have WKB (well-known binary) data that you want to bring into Spotfire?
  • Are you struggling with SQL geometry columns in Spotfire?
  • Do you want to understand more about how Spotfire processes or handles spatial data?
  • Do you want to import spatial data into Spotfire but don’t know how to configure the information link?

Read More

Air Traffic Delays during the Holiday Season

Happy Holi-delays!

Howdy! I’m going to be looking at US air traffic delays during the holiday season. More specifically air travel trends in November and December. This is more of a for fun analysis as well as my first real dive into the world of Spotfire. If there’s anything wonky or weird about my analysis, just bear with me! I’ve posted this template on Exchange.ai so feel free to download it here and follow along.

The data I’m using is from a few different sources which I’ve cited at the bottom of this post.

The data set that kicked off this idea was found on Kaggle, a site for data sets and data science competitions. The original set contained air traffic delays for the entire year of 2008 which then led me to an even larger data set with data all the way back to 1987. My analysis is only looking at November/December in 2006 and 2007 so the view is a little narrow. After concatenating the entire data set, the file was almost 11GB which did result in some cool visualizations but it was too large to make a template out of.

The Map

This data set has some fun properties to it. It allows you to get an idea of the different types of delays that occur such as weather or security. In Spotfire I joined another data set that I found which contains the latitudes and longitudes of each airport. This allowed me to get a map of each airport in the United States.

 

In the above visual, the size represents the number of people traveling to each city while  the color represents the average departure delay. Right off the bat, you can see the highest traffic ports such as Atlanta, Chicago, and DFW. These guys have an absurd amount of traffic going through them. DFW, for example, has about 46k flights going into the port and Atlanta has a whopping 65k unique flights. For fun I did some napkin math to get an idea of how many people are flying into just Atlanta for these two months. I used a rough average for the number of seats on a plane which was about 200 [3] and about 80-85% of the seats are usually filled [4]. That puts us at something like 166 people if we use 83% of the seats being filled. Which means that Atlanta handled somewhere around 10.8M people in these two months for these years combined. That’s a lot of people for a single airport, and they seem to do a pretty good job at handling it! The average overall delay is about 23 minutes. Chicago (ORD) on the other hand is a little worse off with an average overall delay of about 40 minutes. The overall was calculated by just adding the arrival delay and departure delay for each instance.

The Graph

The above line graph shows the delay per day. It’s pretty obvious when the holidays occur and end, which was kind of a neat result of visualizing this. To me, the most interesting thing about this graph is the Half Dome peak right before Christmas Eve.

The similarities are striking!  I also enjoy how the middle of December sees a giant spike in delays and then a big dip back to normality before the climb to the top.

The Data

Interested in digging more into this data? Download the template and play around for some fun visualizations and neat stats. One thing to note is that this template only contains a small subset of the actual dataset that I used. If you want the full set, you’ll have to go to this place and download the zip files. They’re compressed bz2’s so you’ll need a special program to open these such as 7zip or WinRAR. Unless you’re on Linux, then the ole bzip2 -dk in your data’s directory from the shell should be enough. For the longitude and latitudes, I used this site which provides a csv for all airports, not just the USA’s. One thing I found limiting was that the airport data doesn’t contain an airport’s state for those in the United States. Fortunately, I found a site that has this information in table form so you would just need to scrape the site for the relevant information and mess with the data table properties in Spotfire.

Hopefully this post was interesting to you or at the very least an insight into how busy Atlanta is. Thanks for reading and have some happy holidays!

 

Sources:

[1] http://stat-computing.org/dataexpo/2009/the-data.html

[2] https://openflights.org/data.html

[3] https://www.quora.com/What-is-the-average-amount-of-passengers-on-a-plane

[4] https://www.quora.com/How-many-empty-seats-are-there-on-the-average-US-domestic-flight

[5] https://www.kaggle.com/giovamata/airlinedelaycauses

 

Wrangling Data Science in Oil & Gas: Merging MongoDB and Spotfire

Data science in Oil and Gas is central stage as operators work in the new “lower for longer” price environment. Want to see what happens when you solve data science questions with the hottest new database and powerful analytics of Spotfire? Read on to learn about our latest analytics module, the DCA Wrangler. If you want to see it in action, scroll down to watch the video.

Layering Data Science on General Purpose Data & Analytics

Ruths.ai is a startup focused on energy analytics and technical data science. We are both TIBCO and MongoDB partners, heavily leveraging these two platforms to solve real-world problems revolving around the application of data science at scale and within the enterprise environment. I started our plucky outfit a little under four years ago. We’ve done a lot of neat things with Spotfire including analyzing seismic, and well log data. Here, we’ll look at competitor/production data.

The Document model allows for flexible and powerful encoding of decline curve models.

MongoDB provides a powerful and scalable general purpose database system. TIBCO provides tested and forward thinking general purpose analytics platforms for both streaming and data at rest. They also provide great infrastructure products which isn’t in focus in this blog.

Ruths.ai provides the domain knowledge and we infuse our proprietary algorithms and data structures for solving common analytics problems into products that leverage the TIBCO and MongoDB platforms.

We believe that these two platforms can be combined to solve innumerable problems in the technical industries represented by our readers. TIBCO provides the analytics and visualization while MongoDB provides the database. This is a powerful marriage for problems involving analytics, single view or IOT.

In this blog, I want to dig into a specific and fundamental problem within oil and gas and how we leveraged TIBCO Spotfire and MongoDB to solve it — namely Autocasting.

What is Autocasting?

Oil reserves denote the amount of crude oil that can be technically recovered at a cost that is financially feasible at the present price of oil. Crude oil resides deep underground and must be extracted using wells and completion techniques. Horizontal wells can stretch two miles within a vertical window the height of most office floors.

For those with E&P experience, I’m going to elide some important details, like using “oil” for “hydrocarbons” and other technical nomenclature.

Because the geology of the subsurface cannot be examined directly, indirect techniques must be used to estimate the size and recoverability of the resource. One important indirect technique is called decline curve analysis (DCA), which is a mathematical model that we fit to historical production data to forecast reserves. DCA is so prevalent in oil and gas that we use it for auditing, booking, competitor analysis, workover screening, company growth and many other important tasks. With the rise of analytics, it has therefore become a central piece in any multi-variate workflow looking to find the key drivers for well and resource performance.

The DCA Wrangler provides fast autocasting and storage of decline curves. Actual data (solid) is modeled using best-fit optimization on mathematical models (dashed line forecast).

At the heart of any resource assessment model is a robust “autocasting” method. Autocasting is the automatic application of DCA to large ensembles of wells, rather than one at a time.
But there’s a problem. Incumbent technologies make the retrieval of decline curves and their parameters very difficult. Decline curve models are complex mathematical forecasts with many components and variation. Retrieving models from a SQL database often requires parsing text expressions. And interacting with many tables within a database.

Further, with the rise of unconventionals, the fundamental workflow of resource assessment through decline curves is being challenged. Spotfire has become a popular tool for revamping and making next generation decline curve analysis solutions.

Autocasting in Action

What I am going to demonstrate is a new autocast workflow that would not be possible without the combined performance and capability of MongoDB and Spotfire. I’ll be demonstrating using our DCA Wrangler product – which is one of over 250 analytics workflows that we provide through a comprehensive subscription.

Its important to note that software exists to decline wells and database their results. People have even declined wells in Spotfire before. What I hope you see in our new product is the step change in performance, ease-of-use, and enablement when you use MongoDB as the backend.

What’s Next?

First, we have a home run solution for decline curves that requires a MongoDB backend. In the near future, more vendor companies will be leveraging Mongo as their backend database.

Second, I hope you see the value in MongoDB for storing and retrieving technical data and analytic results, especially within powerful tools like Spotfire. Plus, how easy it is to set up and use.

And Lastly, I hope you get excited about the other problems that can be solved by marrying TIBCO with MongoDB – imagine using Streambase as your IOT processor and MongoDB as your deposition environment. Or even store models and sensor data within Mongo and use Spotfire to tweak model parameters and co-visualize data.

If you’re interested in learning more about our subscription, get registered today.

Let’s make data great again.

Part 6 – Visualization Properties

This is the sixth and final part of a series on Spotfire Properties.  In previous posts, I discussed Document PropertiesData Table PropertiesColumn PropertiesData Connection Properties, and Data Function Properties.  This week, we’ll take a look at Visualization Properties properties.

Visualization Properties

To begin, each and every Spotfire visualization has it’s own visualization properties dialog controlling what is possible.  Basically, if it’s not in visualization properties, it can’t be done.  I am sure you have noticed, the dialog changes with each visualization based on the content and functionality of the vis.  In the course of this post, I will explain which ones are common across all visualizations and provide a few “pro” tips.

Common Visualization Properties

When writing this blog post, I decided to create a matrix showing which submenus appear in each visualization properties dialog.  This seemed like a good idea when I started.  Halfway through the assembly, I started to question my motives and the utility of such a matrix.  In the end, the result surprised me. You can download the DXP with this matrix, and I have posted a screenshot below.

Visualization Properties Summary

As it turns, out only three menus are common across all visualization properties — General, Appearance, and Fonts.   After these menus, Data, Legend,  and Show/Hide Items are the most common.

Most Common Visualization Properties Menus

 

Pro Tips

Next, I promised a few pro tips.

  • First, if you ever wonder what’s possible in a given visualization, consult this matrix.  For example, if you want to put Labels on a visualization but don’t see a way to do that, check to see if there is a Labels menu.  If you don’t see a Labels menu, you can’t put Labels on the visualization.
  • Second, always check the Appearance menu for your visualizations, especially if they are new to you or you have gone thru an upgrade.  The Appearance menu usually contains little gems for beautifying visualizations.  I have seen several new options appear there in the last few upgrades.
  • Third, don’t perform formatting in the Formatting menu.  Instead, format in Column Properties or Tools –> Options.  Formatting via this menu is generally the most inefficient way to apply formatting, unless you have one off needs.
  • Fourth, if you aren’t familiar with these menus, I highly recommend checking them out.  They are very useful.  I have a blog post on using the Line Connection, and I’ll update with posts on Error Bars and Show Hide soon.
    • Line Connection — https://datashoptalk.com/8072-2/
    • Error Bars — Error bars are used to indicate the estimated error in a measurement or the uncertainty in a value. Bar charts and line charts can display vertical errors, as indicated in the matrix.
    • Show/Hide — Allows you to restring content.  For example, if you have a bar chart with wells on the X-Axis and production on the Y-Axis, you can ask Spotfire to show only the top 10 producers.  Similarly, you could ask Spotfire to hide the bottom 10 producers.
  • Lastly, the same is true for fonts.  Don’t use the Fonts menu.  Go through Themes.

Additional Settings

In conclusion, I want to point out that a few visualizations also contain Settings menus.  Settings menus are used when the vis has individual, configurable components.  For example, the maps menu also contains a Setting menu for each Layer.  Graphical Tables contain Settings menus for each element in the graphical table.  A summary of such visualizations appears below.

  • Maps — Layer Settings
  • KPI Tiles — KPI Settings
  • Graphical Tables — Icon/Bullet Graph/Sparkline/Calculate Value Settings

Conclusion

In order to wrap up the series, I want to revisit the original questions I posed in the beginning.

  1. What do all of these properties menus do?
  2. Where can I go to change <insert preference here>? I keep setting <insert preference here> over and over again.  There must be a better or faster way.

The six-part series has addressed the first question.  The second question can be answered with this post on user preferences and administration manager preferences.  I hope you found the series useful.