Month: October 2018

Part 3 – Decomposing Spotfire Data Wrangling

I am knee deep in a 7 part series on decomposing Spotfire projects.  In case you missed it, here are the links to the previous posts.  This week, I am covering decomposing data wrangling.  Next, with an understanding of data sources, we’ll look at how they come together and how much data manipulation happens.

 Intro to the Series

Week 1 — Decomposing Data Tables & Connections

Week 2 — Decomposing Data Functions

Now, each post in the series will be broken down into four sections.

  1. Quick and Dirty (Q&D)
  2. The Extended Version
  3. Documentation
  4. Room for Improvement
First, the Q&D explains what to look for to get a general idea of what is going on.  Then, the Extended Version presents a more complex picture. Documentation provides a few examples of how to document if necessary. Lastly, Room for Improvement looks at a project from the standpoint of making it better.

Quick and Dirty

  1. How do the tables fit together?
  2. Did the project developer choose to merge tables or keep data in separate tables?
  3. Are all transformations intact?
  4. For insert columns, do the joins make sense? For insert rows, are columns mapped correctly?

In the first two posts, it was easy to explain the answers with captions on screenshots. Here, the answers are a bit more complicated and require more explanation.

How: Open the Data Panel > Select a data table from the drop-down > Click on the Cog > Click on Source View.

The Data Panel is the best place in the application to understand how a project comes together. It draws a picture of the flow of data, including data wrangling tasks. Look for numbers indicating a count of transformation steps as shown below. Click on the block with the number to get more information, also shown. What you are looking to understand is how tables are (or are not connected). Are tables built with insert columns or insert rows? Did the developer choose to merge data sources rather than keep them in separate tables? Is one table the input for another? I commonly start with a table and then pivot or unpivot it to get a different arrangement of data.
Pay particular attention to any red exclamation marks or yellow warning signs. The red exclamation mark indicates a transformation needs attention and isn’t working. A yellow warning sign signifies that a transformation may no longer apply. In the screenshot below, you also see a blue “i”.  This popped up when a change data type transformation no longer applied.  Originally, the columns were Currency type, so I changed them in the DXP to Real.  Then, they were updated in the information link, so this transformation was no longer relevant.  I recommend removing any that are not necessary. It saves on load time and just makes the project easier to understand.
 
 
It’s not uncommon (especially in the version before 7.13) to see more than one join to the same table. For years, Spotfire did not allow the user to edit joins. This led to painful inefficiency and duplicate columns in projects. For example, let’s say you want to join Table A to Table B. You think you want 3 columns, so you insert them. Later, you realize you need to more. At this point, there are two options — start over or add another join. No one ever chooses to start over…EVER. Consider consolidating all those joins to improve the project.
This can also add duplicate columns to your table. Spotfire builds tables based on a list of columns to ignore, not a list of columns to add. At first glance, you might think this doesn’t matter. It does matter.  Click on that link for more information.  That’s how projects get messy.  Fortunately, in Spotfire 7.13 and higher, users have the ability to edit joins. HOORAY!!!! We waited for that feature for years.
 
Okay, back to the original questions? Do the joins make sense? What does that mean? Did the developer know what they were doing when they chose the join type or the key column? That’s really what you need to review. If they joined on a string column, such as well name, as opposed to API, there is the risk the join won’t always work. Did they use an inner join when they should have used a left join? As a Spotfire developer, you MUST understand the different join methods.
 
If that’s all you have time for, thanks for playing. If you want to really get your hands dirty, keep reading.
 

Extended Version

  1. Are there any tables that aren’t used?
  2. Are all transformations necessary?
  3. Does the flow of data make sense?
  4. Are any data tables created by data functions?
 It is very common for developers to add tables and not use them. Why would they do that you ask? Well, here are few potential scenarios….
  1. The developer found a better/different data source.
  2. The project began with QA and the swapped to PROD and didn’t delete QA.
  3. They started with a spreadsheet and switched to a SQL table.
  4. The user was testing out something that didn’t work. When it didn’t work, they moved on.
The reasons are endless. As the new owner of the project, you should know what is and isn’t used.  The yellow warning signs are a good indicator of transformations that aren’t necessary. Also, there may be calculations that aren’t used in the project. The more excess you can remove from the project, the better it will be.
 
Next, does the flow of data make sense? This is really a broad architecture question. It’s one that will likely involve a good bit of study on the project. Here are a few examples to get you mind moving.
 
Example 1— I was recently working on a project that needed to calculate a net number. Net is calculated by multiplying a value (production or a cost) by a working interest. The developer of the project merged three tables into a master table. Even though the net calculation was only needed in the master table, the working interest was inserted into all three tables. This is inefficient and makes the project slower.
 
Example 2 — I build a LARGE project that sent data through several different tables. Each table performed a QAQC step. When the data was clean, it merged into a master table. The desired output was a set of calculations. I could have done those calculations in the beginning. Instead, I chose to place them only in the master table rather than passing from table to table.
Lastly, just be aware if data functions are creating any of your tables.  If they are, do they run automatically or do the user have to click something or interact with the analysis?  This is commonly a bottleneck in projects.  Depending on how much data you are working with, those data functions might run slowly.
 

Documentation

The screenshots below come from two projects different projects.  The PPT screenshots are from a slide deck whose purpose was to document the project.  The Excel shots are a breakdown I worked up to understand a project in order to modify it.  
This screenshot has all of my tables on the horizontal and vertical axis. The green 1 indicates joined tables. The blue lines indicate stand-alone tables not connected to anything else.
This matrix shows all my tables, and counts are used and summed to indicate how “deeply” a table is used in a project. DF = data function. Tables with a zero indicate they aren’t used anywhere and could be removed from the project.
In PPT, I created a flow to explain how tables work together to create an output.
 When working with large projects, it’s a good idea to document.  Six months later, you aren’t going to remember how you built the project. 

Improvement

  1. Is there any way to make the table build/project load more efficient?
  2. Is Spotfire the best place to perform the data wrangling?
  3. Would it be possible to speed up the project with a different architecture?
  4. Where are bottlenecks?
 I could write forever on these broad questions. For the sake of time, I am going to link to a few previous blog posts on optimization to help answer the first question.
 Next, I’m seeing more companies deploy Alteryx alongside BI tools like Spotfire and Tableau.  It’s definitely something to think about. I firmly believe there are use cases where data wrangling in Alteryx makes more sense than Spotfire. If you are using Spotfire data wrangling to provide an input to visualizations and that data wrangling is time-consuming, do it in Alteryx.  Because Alteryx provides a single input, it will load faster.  Alternatively, leave the data wrangling in Spotfire when using dynamic or interactive capabilities like marking, filtering, or property controls to move data around. 
Lastly, is it possible to speed up the project with a different architecture? Could you use a data connection instead of an information link for faster loading? Could you connect to a data set in the library instead of an Excel file on a shared drive? Would an ODBC connection be faster loading MS Access than a File connection? Would it be possible to configure a table as an On Demand table rather than loading all the data?  I realize these are more questions than answers, but it should get your head in the right place to make solid improvements.
If you made it this far, gold star for you. Thanks for sticking with it. Next week, I’ll cover decomposing document properties.
 

Spotfire Version

 All content created with Spotfire 7.12.

 

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!

Excel to Spotfire: Targeting Missing Values

In Data Science, prior to using any analytics we always face the problem of having missing values. Deciding when to delete cases or fill missing values is totally dependent on the data set and the target problem. A general practice consists on deleting variables and samples with more that 30% of missing values and then use Multiple Imputation techniques to fill the remaining missing values. In Excel, we can use the replace tool or the filter to approach this problem, and even use Visual Basic to code a more customized solution. In Spotfire, we have the advantage of using more advanced methods by accessing R libraries that contain MCMC, Bayesian and Multivariate Algorithms. Spotfire’s integrative tools really make a difference on how to approach the missing values problem: putting together advanced algorithms, amazing visualizations and user interactivity.

Read More

Build Type Wells using Selected Wells Method in DCA Wrangler

Reserves evaluators often want to build a Percentile Type Well that represents a certain percentile of the population. It is desired to determine a “P90 Type Well”, “P50 Type Well”, or “P10 Type Well”. When expressed this way evaluator is inherently seeking a type well that results in a percentile EUR. The P90 Type Well will be a representative well where there is a 90% chance that the EUR will be that number or greater. There are two published methods for creating Percentile Type Wells, Time Slice approach and Selected Wells approach.

So, the Percentile Type Wells are expected to provide a forecast that will have an EUR consistent with the target probability. This is not possible with the Time Slice method because that method is based on Initial Productivity (IP) and rates. In other words, Time Slice method makes an implicit assumption of a strong correlation between IP and EUR, whereas in a real-world scenario correlation between IP and EUR has a wide scatter, resulting in a Type Well with an EUR that does not represent the desired percentile. Refer to SPE – 162630 for a more technical discussion on the two methods.

In this blog post we will go through a workflow on how to create Type Wells using the Selected Wells method in DCA Wrangler. We created a template that creates Type Wells using Selected Wells Method, Time Slice Method and using individual well forecasts in the Selected Wells Method.

Following is the workflow for Selected Wells Method:

  1. Select wells in an Area of Interest (AOI)

  2. Create an Auto-Forecast for all the selected wells with desired number of years using DCA wrangler. While doing the Auto Forecast we will use a three-segment approach. The first segment with a constrained b – factor between 1 and 2 (this will take care of the characteristic steep initial decline present in most MFHWs in unconventionals). The second segment with a constrained b – factor between 0 and 1. The third segment for terminal exponential decline.

  3. Generate Well DCA and Well DCA Time results in DCA Wrangler. The Well DCA Time table will have the forecast data for all the wells created using the fitted Arps Model. Remember to refresh these tables every time you change the wells in your AOI.

  4. Next, we will find wells for Target EUR probabilities on an EUR Probit plot generated using all the wells in our AOI. We can enter a threshold value (α) to find wells which have their EUR within the (1 ± α) × EUR at the target probabilities. We can also quickly check the number of wells present within the threshold at each of the target probabilities. Adjust the threshold to get a minimum desired number of wells at each of the target probabilities.

  5. Now we can create Percentile Type Wells for our AOI by running DCA Wrangler in the Type Well mode using the wells we selected in our previous step.

Check out the template and try it with your production data.

Nitin is a Data Scientist at Ruths.ai working passionately towards helping companies realize maximum potential of their data. He has experience with machine learning problems in clustering, classification and regression applying ensemble and Bayesian approaches with toolsets from R, Python, and Spotfire. He is currently pursuing his PhD in Petroleum Engineering at Texas A&M University, where his research is focused on applications of machine learning algorithms in petroleum engineering workflows. He enjoys cycling, running and overindulging in statistical blogs in his pastime.

The Most Important Thing You Don’t Know About How Spotfire Builds Tables

  • After you insert columns, the Spotfire Source Information tab describes the columns ignored rather than the columns added. Why is that?
  • Did Spotfire add columns to your project? Are you seeing columns in your project that you don’t recognize?
  • Do you have duplicate columns in a table, and you have no idea how they got there? Have you tried to delete them, and they keep coming back?

Read More

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!

Part 2 – Decomposing Data Functions

This is part 2 of a 7 part series on how to decompose Spotfire projects.  Last week, I started off with a post on data tables and connections.  This week, I am covering how to decompose data functions.  Data functions are quite frequently used to create data tables and can create a lot of complexity in a project.  Thus, the series covers them early on.

Now, each post in the series will be broken down into four sections.

  1. Quick and Dirty (Q&D)
  2. The Extended Version
  3. Documentation
  4. Room for Improvement

The Q&D explains what to look for to get a general idea of what is going on.  The Extended version works to develop an in-depth knowledge of the project.  Each post will also include suggestions for documentation and improving the project.

Quick and Dirty

  1. What are the data functions doing?  Are they creating new tables, new columns, or just transforming data?
  2. What’s going into and out of them? What are the input and output parameters?
  3. Do they work?  It happens that someone attempts to write a data function and it doesn’t work out.  Rather than delete it, they just leave it.

How: Go to Edit > Data Function properties and review the script, input parameters, and output parameters.

 

First, go to Edit script. Then, Edit Parameters. Finally, hit Refresh to confirm the data function works.

 

 

Work thru each tab to understand the data function.

 

The Edit Parameters dialog provides information on how the parameters are connected to the data in the DXP.

Extended Version

  1. Are the data functions using property controls, filtering schemes, or marking?
  2. Is any code commented out?
  3. Do the data functions use packages?
Data limiting with filtering or marking is shown in the Edit Parameters dialog towards the bottom of the menu.

 

The # sign comments out a line of code.  When a line of code is commented out, it will not be executed. Sometimes developers comment out code while testing.  Other times, the # sign is use to add comments on what the code does.

 

Developers commonly attach a package to the library or install the package in the first few lines of code.

Documentation

The screenshots below were taken from the documentation of two different projects.  One of them I fully documented using a PowerPoint deck.  The slides shown below summarize the usage of data functions. The last screenshot is Excel documentation to show which tables were an input or an output of a data function.

Improvement

Now, here are a few suggestions for improving data functions in a project.

  1. Add commenting in the data function to improve ease of understanding, especially if it’s a long data function.
  2. Remove unused sections of code to clean up the data function.
  3. Combine data functions where possible. It’s very common for developers to build as they go, but that separate might make the project harder to understand. Consider combining data functions where applicable.
  4. Add descriptions in the data functions if they don’t have them.  This will make them easier to understand 6 months from now.
This screenshot demonstrates adding commenting to translate the object names to table names in the DXP.

Spotfire Version

All content created with Spotfire 7.12.

Well, that wraps up how to decompose data functions.  Next week, we’ll look at how to understand table build and data wrangling.

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!

Accessing Data Function Parameters

I had one of those moments recently, where a dialog box I’ve stared at for years presented something new to me.  In this case, it was the Settings button in the Data Table Properties dialog.  Now, this button will be grayed out for all tables except on-demand tables.  At least, that is what I thought.  As I scrolled thru my tables, the box appeared active for a table I knew was not configured on-demand.  Rather, a data function built the table.  Clicking on it then pulled up the Edit Parameters dialog.  When investigating tables, it’s helpful to be able to access data function parameters from Data Table properties.

 

Clicking on Settings pulls up the Edit parameters dialog for any table created by a data function.
This allows the user to access data function parameters without going thru the Data Functions dialogs.

Spotfire Version

Content created with Spotfire 7.12.

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!

Part 1 – Decomposing Tables & Data Sources

Last week, I teed up a series on decomposing Spotfire projects.  This week, we are diving in with Part 1 — Decomposing tables and data sources.  I cover tables and data sources first because they are the foundation for the project. You must understand where data is coming from to understand the rest of the project.

Now, each post in the series will be broken down into four sections.

  1. Quick and Dirty (Q&D)
  2. The Extended Version
  3. Documentation
  4. Room for Improvement

The Q&D explains what to look for to get a general idea of what is going on.  The Extended version works to develop an in-depth knowledge of the project.  Each post will also include suggestions for documentation and improving the project.

Quick Word on Documentation

Before we dive in, I want to say a few words about documentation.  Decomposing Spotfire projects will most likely be messy and complex.  It may start off simple, leading you to believe, you can keep track of how everything is built and structured in your head.  I’ll tell you right now, you can’t.  Any project that has more than 3 tables will get complicated quickly.  It’s also likely you’ll be interrupted or have to work intermittently.  It’s likely, even probable, that you’ll forget what you learned.  I highly recommend documenting as you go.  There is no perfect tool.  I use a combination of PowerPoint and Excel.  Ultimately, I would love to see some documentation functionality available in the application in the future.  Just pick a tool and stick with it through a single project.

Now, let’s dig into tables.

Quick and Dirty

  1. How many tables does the project contain?
  2. Are any tables embedded?
  3. Where do the tables originate from?

How: Go to  Edit > Data Table Properties > Source Information or open the Data Panel.  See specific details in the captions.  Click on one table at a time.

Click on one table at a tab and view the first piece of information in the Source Information tab.

 

Select the table from the drop-down in the top right-hand corner of the panel. Then, click on the Cog. Finally, click Source View in the top right-hand corner of the expanded menu. Source Information is in the information tab.

 

Information on whether a table is linked or embedded is in the General tab.

 

Extended Version

  1. Are any of the data sources on demand?
  2. Are any of the data sources limited by prompts before entering the DXP?
  3. How large are the tables and how long do they take to load?
  4. Are there any data connections in the DXP?

How: Go to  Edit > Data Table Properties > Source Information or open the Data Panel.  See specific details in the captions.  Click on one table at a time.

If a table is either configured with on-demand settings or created with a data function, the Settings button will be available (i.e. not grayed out). Click to see on demand settings or data function parameters.

 

On demand settings are clearly visible from the Data Panel when you click on the first data source.

 

Unfortunately, the only way to know if an information link has prompts and what has been selected is to click Refresh Data With Prompt on all information links. It’s painful, but you’ll only do it once.

How: Go to Help > Support Diagnostic and Logging > Diagnostic Information

This menu shows the size and load time for every table in the DXP. This includes counts of columns and rows. Click the Copy Info button and paste into Excel for easier viewing.

How: Go to Edit > Data Connection Properties 

Go to Edit > Data Connections to see if any data connections exist.

Lastly, another obvious question to ask and answer is — Are all of the data tables being used in the project?  A future post will address this question.

Documentation

As previously noted, documentation can be as simple or as complex as you want it to be.  This is an example of a simple PPT table containing information on the table source and some general properties.  I would also add a column to denote On Demand tables.

This is a PPT slide I put together to summarize the usage of different tables in a DXP.  This slide was part of a larger PPT deck to document a project.

Improvement

Finally, before wrapping up, I want to suggest a few ways to improve the project tables.  Here is a bit of QA for you to think about.

  1. Would renaming tables make them easier to understand?  Take a look at this post on naming data tables. 
  2. Would renaming data connections make them easier to understand?  Most people don’t name their data connections, and if there is more than one, it gets confusing quickly.
  3. If a table is embedded, would it load faster if transitioned to a data set? It is frequently the case that connecting to a data set is faster than connecting to a file.

That’s it for this post.  I hope you found it useful. Next week’s article will look at Data Functions.

Spotfire Version

Content created with Spotfire 7.12.

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!

How to Decompose a Spotfire Project

  • Have you “inherited” someone else’s Spotfire project?
  • Has another group built a project for you and you would like to understand it better?
  • Would you like to familiarize yourself with other developers’ projects?
  • Are you in charge of modifying someone else’s project?

Read More

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!