• Have you ever wanted/needed to merge data sets where the merge would create unwanted duplicates?
  • Have you ever attempted to merge public and private data and struggled with getting the desired output?
  • Have you ever wanted to know how to identify duplicate records in Spotfire?

Many oil and gas companies purchase subscriptions to public data through data distributors/data aggregators like Drilling Info or IHS.  It’s a common task to then merge in house data with public data.  This is done frequently with production data and can be easily accomplished in Spotfire.

Now, I want to note that there is more than one way to accomplish this task.  The method I am going to demonstrate below involves inserting rows from each data source into one table, identifying duplicate records, and then passing only the desired records to another table.  You could do it differently by inserting columns, but I believe that way to be more complex, so I’ll use insert rows.

Additionally, there are a few things that you’ll need to pay close attention to.

  1. Level of granularity — Public production data is reported as a monthly total.  All oil companies will have in house daily data and monthly data.  The user I helped with this task wanted to use daily data.  In order to get these two data sources to the same level of granularity, we had perform calculations on both data sets to get a daily average for each month.
  2. Dates  — Since you may be changing the granularity of the tables, pay close attention to date columns.  I needed to create a new date column indicative of the Year and Month of production.  My first attempt with Year-Month came out as a string, but the calculation required a date, so I had to create a dummy date column.  You will also need to remove time stamps.
  3. Duplicate records — Most users get stuck when trying to identify and deal with wells that have data in both the public data set and the in house data set for a given month.  This solution will use a calculated column to identify the duplicate and a data function to pass through only the desired records to another table that is the “final” table.

 

The data I am using in this example is a dummy data set.

Here are the detailed steps:

Step 1 — Add each data source to Spotfire (my example is a one well example). Remember, the in house data is daily and the public data is monthly.

Step 2 — In the in house data set, insert a transformation to create a new calculated column based on the production date.  In this case, I set the date to the 15th of the month, knowing that my next step would be to aggregate the data into a monthly number, and for that I will need both data sets to have a matching date.

Step 3 — Insert a pivot transformation to roll the data up to the monthly level.   In this case, I am using the average aggregation in order to get a monthly average.

 

Step 4 —  The public data set I was working with had two columns — Year and Month — for the date, so a calculated column is needed in order to get this to a date Spotfire can work with.  Insert a transformation to create a new calculated column that converts  the public data set date to what is needed to match the in house data.

Step 5 — In the public data set, insert a transformation to create a new calculated column (or replace the old column) that converts the monthly total to a daily average.  Make sure to do this for all oil, gas, and water columns.

Step 6 — Merge the two data sets using an insert rows operation.  Match the columns from each data set, and make sure to create a new column that specifies the data source.

Step 7 — Insert two calculated columns into the merged data set.  One will calculate how many records there are for each well and for each month.  The second is an IF statement that specifies what to do with it if a well has two records. Note that the IF statement gives preference to the in house data.  If that is available, it will be used.  Otherwise, it will be filtered out.

At this point, here is what the data table looks like.

Step 8 — Create a new filtering scheme, and use the c.Pass thru filter to filter out the records that aren’t needed.

Step 9 — Add a data function to pass only the desired records through to another table.

Go to the Edit menu, select Data Function Properties, and click the Register New button.

Name the data function “pass thru”, and enter the script as shown in this screenshot.

Add one input parameter and one output parameters as shown below.

Click the Run button

Map the parameters to the data in the tables as shown below.  Selecting “Search expression”, and entering “*” will always transfer all columns in the specified data table.  Make sure to scroll down and limit by the filtering scheme you created for this purpose. Mine is called “pass thru”.

Mapping input parameters

Mapping output parameters

For a new data function, you should select “Create new data table”.  This is what happens when you take screenshots after the fact!  🙂

Add a new data table, and select “final table”.  The output should contain only the filtered records.  This type of solution provides the ability to see both the public data set, as well as the in house data set, but then only work with a “gold standard” based on which records should take precedence.

 

Written by Julie Schellberg of Big Mountain Analytics, LLC
Residing in Whitefish, MT, an analytics partner with Ruths.ai. Specializing in Spotfire analytics, dabbling in Power BI and R.