Part 3 – Decomposing Spotfire Data Wrangling

I am knee deep in a 7 part series on decomposing Spotfire projects.  In case you missed it, here are the links to the previous posts.  This week, I am covering decomposing data wrangling.  Next, with an understanding of data sources, we’ll look at how they come together and how much data manipulation happens.

 Intro to the Series

Week 1 — Decomposing Data Tables & Connections

Week 2 — Decomposing Data Functions

Now, each post in the series will be broken down into four sections.

  1. Quick and Dirty (Q&D)
  2. The Extended Version
  3. Documentation
  4. Room for Improvement
First, the Q&D explains what to look for to get a general idea of what is going on.  Then, the Extended Version presents a more complex picture. Documentation provides a few examples of how to document if necessary. Lastly, Room for Improvement looks at a project from the standpoint of making it better.

Quick and Dirty

  1. How do the tables fit together?
  2. Did the project developer choose to merge tables or keep data in separate tables?
  3. Are all transformations intact?
  4. For insert columns, do the joins make sense? For insert rows, are columns mapped correctly?

In the first two posts, it was easy to explain the answers with captions on screenshots. Here, the answers are a bit more complicated and require more explanation.

How: Open the Data Panel > Select a data table from the drop-down > Click on the Cog > Click on Source View.

The Data Panel is the best place in the application to understand how a project comes together. It draws a picture of the flow of data, including data wrangling tasks. Look for numbers indicating a count of transformation steps as shown below. Click on the block with the number to get more information, also shown. What you are looking to understand is how tables are (or are not connected). Are tables built with insert columns or insert rows? Did the developer choose to merge data sources rather than keep them in separate tables? Is one table the input for another? I commonly start with a table and then pivot or unpivot it to get a different arrangement of data.
Pay particular attention to any red exclamation marks or yellow warning signs. The red exclamation mark indicates a transformation needs attention and isn’t working. A yellow warning sign signifies that a transformation may no longer apply. In the screenshot below, you also see a blue “i”.  This popped up when a change data type transformation no longer applied.  Originally, the columns were Currency type, so I changed them in the DXP to Real.  Then, they were updated in the information link, so this transformation was no longer relevant.  I recommend removing any that are not necessary. It saves on load time and just makes the project easier to understand.
 
 
It’s not uncommon (especially in the version before 7.13) to see more than one join to the same table. For years, Spotfire did not allow the user to edit joins. This led to painful inefficiency and duplicate columns in projects. For example, let’s say you want to join Table A to Table B. You think you want 3 columns, so you insert them. Later, you realize you need to more. At this point, there are two options — start over or add another join. No one ever chooses to start over…EVER. Consider consolidating all those joins to improve the project.
This can also add duplicate columns to your table. Spotfire builds tables based on a list of columns to ignore, not a list of columns to add. At first glance, you might think this doesn’t matter. It does matter.  Click on that link for more information.  That’s how projects get messy.  Fortunately, in Spotfire 7.13 and higher, users have the ability to edit joins. HOORAY!!!! We waited for that feature for years.
 
Okay, back to the original questions? Do the joins make sense? What does that mean? Did the developer know what they were doing when they chose the join type or the key column? That’s really what you need to review. If they joined on a string column, such as well name, as opposed to API, there is the risk the join won’t always work. Did they use an inner join when they should have used a left join? As a Spotfire developer, you MUST understand the different join methods.
 
If that’s all you have time for, thanks for playing. If you want to really get your hands dirty, keep reading.
 

Extended Version

  1. Are there any tables that aren’t used?
  2. Are all transformations necessary?
  3. Does the flow of data make sense?
  4. Are any data tables created by data functions?
 It is very common for developers to add tables and not use them. Why would they do that you ask? Well, here are few potential scenarios….
  1. The developer found a better/different data source.
  2. The project began with QA and the swapped to PROD and didn’t delete QA.
  3. They started with a spreadsheet and switched to a SQL table.
  4. The user was testing out something that didn’t work. When it didn’t work, they moved on.
The reasons are endless. As the new owner of the project, you should know what is and isn’t used.  The yellow warning signs are a good indicator of transformations that aren’t necessary. Also, there may be calculations that aren’t used in the project. The more excess you can remove from the project, the better it will be.
 
Next, does the flow of data make sense? This is really a broad architecture question. It’s one that will likely involve a good bit of study on the project. Here are a few examples to get you mind moving.
 
Example 1— I was recently working on a project that needed to calculate a net number. Net is calculated by multiplying a value (production or a cost) by a working interest. The developer of the project merged three tables into a master table. Even though the net calculation was only needed in the master table, the working interest was inserted into all three tables. This is inefficient and makes the project slower.
 
Example 2 — I build a LARGE project that sent data through several different tables. Each table performed a QAQC step. When the data was clean, it merged into a master table. The desired output was a set of calculations. I could have done those calculations in the beginning. Instead, I chose to place them only in the master table rather than passing from table to table.
Lastly, just be aware if data functions are creating any of your tables.  If they are, do they run automatically or do the user have to click something or interact with the analysis?  This is commonly a bottleneck in projects.  Depending on how much data you are working with, those data functions might run slowly.
 

Documentation

The screenshots below come from two projects different projects.  The PPT screenshots are from a slide deck whose purpose was to document the project.  The Excel shots are a breakdown I worked up to understand a project in order to modify it.  
This screenshot has all of my tables on the horizontal and vertical axis. The green 1 indicates joined tables. The blue lines indicate stand-alone tables not connected to anything else.
This matrix shows all my tables, and counts are used and summed to indicate how “deeply” a table is used in a project. DF = data function. Tables with a zero indicate they aren’t used anywhere and could be removed from the project.
In PPT, I created a flow to explain how tables work together to create an output.
 When working with large projects, it’s a good idea to document.  Six months later, you aren’t going to remember how you built the project. 

Improvement

  1. Is there any way to make the table build/project load more efficient?
  2. Is Spotfire the best place to perform the data wrangling?
  3. Would it be possible to speed up the project with a different architecture?
  4. Where are bottlenecks?
 I could write forever on these broad questions. For the sake of time, I am going to link to a few previous blog posts on optimization to help answer the first question.
 Next, I’m seeing more companies deploy Alteryx alongside BI tools like Spotfire and Tableau.  It’s definitely something to think about. I firmly believe there are use cases where data wrangling in Alteryx makes more sense than Spotfire. If you are using Spotfire data wrangling to provide an input to visualizations and that data wrangling is time-consuming, do it in Alteryx.  Because Alteryx provides a single input, it will load faster.  Alternatively, leave the data wrangling in Spotfire when using dynamic or interactive capabilities like marking, filtering, or property controls to move data around. 
Lastly, is it possible to speed up the project with a different architecture? Could you use a data connection instead of an information link for faster loading? Could you connect to a data set in the library instead of an Excel file on a shared drive? Would an ODBC connection be faster loading MS Access than a File connection? Would it be possible to configure a table as an On Demand table rather than loading all the data?  I realize these are more questions than answers, but it should get your head in the right place to make solid improvements.
If you made it this far, gold star for you. Thanks for sticking with it. Next week, I’ll cover decomposing document properties.
 

Spotfire Version

 All content created with Spotfire 7.12.

 

Guest Spotfire blogger residing in Whitefish, MT.  Working for SM Energy’s Advanced Analytics and Emerging Technology team!

Leave a Comment

Your email address will not be published. Required fields are marked *