Merging In House and Public Data in Spotfire

  • Have you ever wanted/needed to merge data sets where the merge would create unwanted duplicates?
  • Have you ever attempted to merge public and private data and struggled with getting the desired output?
  • Have you ever wanted to know how to identify duplicate records in Spotfire?

Many oil and gas companies purchase subscriptions to public data through data distributors/data aggregators like Drilling Info or IHS.  It’s a common task to then merge in house data with public data.  This is done frequently with production data and can be easily accomplished in Spotfire.

Now, I want to note that there is more than one way to accomplish this task.  The method I am going to demonstrate below involves inserting rows from each data source into one table, identifying duplicate records, and then passing only the desired records to another table.  You could do it differently by inserting columns, but I believe that way to be more complex, so I’ll use insert rows.

Additionally, there are a few things that you’ll need to pay close attention to.

  1. Level of granularity — Public production data is reported as a monthly total.  All oil companies will have in house daily data and monthly data.  The user I helped with this task wanted to use daily data.  In order to get these two data sources to the same level of granularity, we had perform calculations on both data sets to get a daily average for each month.
  2. Dates  — Since you may be changing the granularity of the tables, pay close attention to date columns.  I needed to create a new date column indicative of the Year and Month of production.  My first attempt with Year-Month came out as a string, but the calculation required a date, so I had to create a dummy date column.  You will also need to remove time stamps.
  3. Duplicate records — Most users get stuck when trying to identify and deal with wells that have data in both the public data set and the in house data set for a given month.  This solution will use a calculated column to identify the duplicate and a data function to pass through only the desired records to another table that is the “final” table.

 

The data I am using in this example is a dummy data set.

Here are the detailed steps:

Step 1 — Add each data source to Spotfire (my example is a one well example). Remember, the in house data is daily and the public data is monthly.

Step 2 — In the in house data set, insert a transformation to create a new calculated column based on the production date.  In this case, I set the date to the 15th of the month, knowing that my next step would be to aggregate the data into a monthly number, and for that I will need both data sets to have a matching date.

Step 3 — Insert a pivot transformation to roll the data up to the monthly level.   In this case, I am using the average aggregation in order to get a monthly average.

 

Step 4 —  The public data set I was working with had two columns — Year and Month — for the date, so a calculated column is needed in order to get this to a date Spotfire can work with.  Insert a transformation to create a new calculated column that converts  the public data set date to what is needed to match the in house data.

Step 5 — In the public data set, insert a transformation to create a new calculated column (or replace the old column) that converts the monthly total to a daily average.  Make sure to do this for all oil, gas, and water columns.

Step 6 — Merge the two data sets using an insert rows operation.  Match the columns from each data set, and make sure to create a new column that specifies the data source.

Step 7 — Insert two calculated columns into the merged data set.  One will calculate how many records there are for each well and for each month.  The second is an IF statement that specifies what to do with it if a well has two records. Note that the IF statement gives preference to the in house data.  If that is available, it will be used.  Otherwise, it will be filtered out.

At this point, here is what the data table looks like.

Step 8 — Create a new filtering scheme, and use the c.Pass thru filter to filter out the records that aren’t needed.

Step 9 — Add a data function to pass only the desired records through to another table.

Go to the Edit menu, select Data Function Properties, and click the Register New button.

Name the data function “pass thru”, and enter the script as shown in this screenshot.

Add one input parameter and one output parameters as shown below.

Click the Run button

Map the parameters to the data in the tables as shown below.  Selecting “Search expression”, and entering “*” will always transfer all columns in the specified data table.  Make sure to scroll down and limit by the filtering scheme you created for this purpose. Mine is called “pass thru”.

Mapping input parameters

Mapping output parameters

For a new data function, you should select “Create new data table”.  This is what happens when you take screenshots after the fact!  🙂

Add a new data table, and select “final table”.  The output should contain only the filtered records.  This type of solution provides the ability to see both the public data set, as well as the in house data set, but then only work with a “gold standard” based on which records should take precedence.

 

7 thoughts on “Merging In House and Public Data in Spotfire

  1. Is there a way to graph daily data and monthly data on the same graph but keep the data tables separate? Even when I match columns and get everything lined up it will not show the second (not main) table correctly. I have tried using the monthly table as the main table as well as the daily table. Thanks a bunch!

    1. This is possible using column matches. Are you getting an error or just no data or the wrong data? Some things that could throw it off..

      …the date can be tricky. Make sure that you have a column match on the date that is on the x axis. That date must be the same data type in both tables. It won’t work if one is a datetime and the other is a date or if one is a date and the other is a string.
      …on the vis itself, you have to use the higher level of granularity (i.e. monthly). You should have month or a year > month hierarchy on the x axis.

      1. That did it! You were right it was a problem with my date column data type and match. It also worked for me using the daily data as my main table with the data as year-month-day of month aggregation and the monthly table divided by 30 as a second y-axis. I also figured out I had to use “Line by” Well ID so the second line would aggregate correctly. Otherwise it just used the entire data set even when I only had one well selected. Thank you!

          1. Is it possible to do this same thing but normalize the date so the X-axis is in units of number of months (month 1, 2, 3 etc.) And the daily data would be in fractions of months so day 1 is month 0.033, day 2 is month 0.066 etc. When I try to do this I have to bin the x-axis to get the second table to show up but it can’t match the normalized date column- it says “The expression cannot be matched because it contains functions that are not row based”.

Leave a Comment

Your email address will not be published. Required fields are marked *