One of the most critical aspects of building Spotfire projects is choosing the optimal architecture. I recently had the “opportunity” to rebuild the same Spotfire project 3x. Let’s not worry about why that happened, but instead focus on the fact that I did arrive at the optimal architecture for the project.
In rebuilding the same thing three times, I came up with a few tips and tricks, especially as it relates to scalability. Now, before I dive into the details, I need to brief you on the project.
The purpose of the project was to compare the same (allegedly) data from three different sources, and I needed to be prepared for the following:
- Different wells in each data set
- Different date ranges in each dataset (ex. Well 1 has production volumes reported for Jan 1, 2017 – Jan 1, 2018 in dataset 1 but only for Feb 1, 2017 thru Dec 1, 2017 in data set 2)
- Comparing oil, gas, and water volumes across the three data sets
- Comparing several dates across the three data sets (ex. spud date, completion date, first date of production)
- The possibility of broadening the analysis to include more points of comparison at a later date (remember this, it will be important).
Furthermore, the analysis needed to be efficient and interactive. One page per data point comparison was too many pages. Instead, the design called for one page comparing volumes and one page for comparing dates, with property controls for the user to specify the comparison they would like to see.
Now, if you have been thru the discussion of insert rows versus insert columns, just skip to the end of the post where I have saved the best for last. If you aren’t certain whether you would use insert rows or insert columns, read on.
Insert Rows versus Insert Columns
What is the optimal method to combine data tables in Spotfire? Well, the answer is not always straightforward, and there are pros and cons to inserting rows versus inserting columns. The basics are this…
- SUPER easy and fast. The easiest of all options. Just match up the columns.
- Changes? No problem! Insert row operations are editable in the Data Panel.
- Scaleable. Just insert more rows.
- Calculations are much more difficult. Comparing Gas/Oil from one data set to another is not as simple as Column A – Column B. You must use Over and Node Navigation.
- Calculations get extra complicated with the scenarios described above (different wells, different periods, missing wells, etc).
- Cannot remove an insert rows operation (yet).
- Calculations are much easier.
- Cannot edit the insert columns operation (yet).
- Scaleable, but with the risk that IF the underlying tables get more columns, all of a sudden you’ll have columns in your tables that you don’t need or want.
- Joins can be messy or difficult.
- The biggest con of this method (and the ultimate reason I didn’t use insert columns) is that you wind up with multiple columns for each value (ex. multiple well name columns). If your dataset contains a lot of header data, this can be particularly frustrating and time-consuming. The most common way to deal with it is to create calculated columns with IF logic (or case statements) that look at each column and return a value. This was a deal breaker.
The Final Answer
So, which method did I choose? In the past, I have gone with insert columns. However, in the end, I went with an insert rows operation with a pivot transformation. The insert rows operation allowed me to easily combine the tables in a way that was editable and scalable. Then, by adding a pivot operation, I was able to quickly create more columns so that the calculations were easier, and I didn’t have multiple columns for a single value. WIN! If you aren’t familiar with pivot transformations, you can read about them here. Here is what it all looks like in a very simple one well example.
(Remember, it’s one well, so there is one record)
Saved the Best for Last
Of course, what I really wanted to share with you I saved for the end. One of the most critical elements of this build was the fact that it needed to be scaleable. I needed to be able to add additional points of comparison at a later date. This can be problematic with insert columns operations. However, I found a great solution for scalability with insert columns. (Just ignore the fact that I ultimately went with insert rows. This still good information).
Now, the first step in any project is adding the data tables. In this case, I used information links, and they were HUGE! They contained far more columns than I was going to use and a lot of data I wasn’t familiar with. Therefore, I added in an Exclude columns transformation to exclude all unused columns. This proved to be incredibly helpful. For the sake of expediency, I’m going to list out the steps so far….
- Add all data tables to the project (completion data table, header table, production summary table)
- Insert an Exclude columns transformation on each table to exclude the columns not being used
- Create a “master” table by duplicating one of the tables (File > Add Data Table > From Current analysis) — learn more about the master table and why you should exclude columns in this post.
- Insert columns from the other tables into the master table
At this point, I have a master table with all of my data. Now, if you have ever looked closely at the Data Table Source Information tab, you have noticed that they way Spotfire inserts columns is by building a list of columns to ignore rather than columns to insert. I have never understood why the application is built this way (surely just to frustrate users). In this case, it is EXTREMELY helpful!
Keep in mind, the objective to be scaleable. If I want to add in data points, all I have to do is modify the Exclude columns transformation and the columns flow in. Now, because Spotfire builds up a list of columns to ignore rather than a list of columns to add, the newly included columns are not in that list, and so they flow through step 4. There is no need to perform additional insert columns operations!!!! (I just got a little sheepish when I realized how excited I am about this). This made the analysis much more scaleable. I was relieved to not have to perform another insert columns operation for each new data point. Tragedy averted. DONE.
If you wondered what I was doing up in Step 3, just hang on until next week. I’ll explain that bit of architecture as well.
Guest Spotfire blogger residing in Whitefish, MT. Working for SM Energy’s Advanced Analytics and Emerging Technology team!