CRISP DM Data Preparation: Finding and Counting NULL Values in Spotfire

Hello, good friends.  The next step in our Analytics Journey takes us to the second iteration of Data Preparation.  This is the third step of the CRISP-DM method.

Today, we are going to look at one of the most common data quality issues in Spotfire:  the NULL, aka missing values.  While there are many ways to address NULL values like imputation (a lesson for another day), the first step is simply identifying them. We will walk through Spotfire’s built in NULL identifier and also a more advanced TERR based method.

Data Preparation:  Finding Null Values

So many times, we have clients bring us Spotfire dashboards that won’t work because the data has missing values.  Often, it is hard to find the culprit column…until now.

First, Spotfire does report missing values if one knows where to look and takes the time to find them.

We will use a made up dataset involving fruit to demonstrate.

We see the “Eaten” and “Price Per Unit” columns have missing values.  To use Spotfire’s built in functionality, we go to View > Data > select your Data Table > select your column > hit the cog icon.

Above, we see the column “Eaten” has 3 empty values, which can be replaced depending on different choices like a specific value or average.  This is fantastically useful.  However, this information also has severely limitations when it comes to scale.  No one wants to go through this process each time when many columns are present.  So, we turn to TERR.

Searching for NULLs Using TERR

Fear not, the R Code utilized here will be of the copy and paste variety, so you should be okay as long as you are a moderate Spotfire user.  You will only need to set up a data function by following the below steps.  Our goal in using this code will be to tell us extremely quickly if we have any missing values in the whole dataset and, if so, how many missing values are in each column even if we have hundreds of columns.

This code will work with any data table, not just our dummy data, if you follow the steps.

First, we need to set up a new Data Function:  Edit > Data Function Properties > Register New

Give the Function a Name, then in the script, copy and paste:


###

myInput <- InputData

#Nulls are called NAs here because of the R usage

na.function <- function(dataset){

  na <- c()
  columns <- c()
  missing <- c()
  for(i in 1:length(dataset)){
    na[i] <- sum(is.na(dataset[,i]))
    if (na[i] != 0){
      columns <- append(columns,colnames(dataset[i]))
      missing <- append(missing,na[i])
    }
  }
  miss <- data.frame(columns,missing)
  #rownames(miss) <- columns
  return(data.frame(miss))

}

nullTable <- na.function(myInput)

###

 

Do not hit OK but once the script has been entered go to > Input Parameters > Add

Above, give the input parameter name and display name “InputData”.  This is important to match the code.  Choose “Table” for Type and “All” for allowed Data Types.  Hit “OK” once.

Still in Register Data Functions, Go to Output Parameters > Add

Name the Result Parameter Name and Display Name “nullTable” to match the previous code.  Hit “OK”.

Save the Register Data Functions page.

Back at the Data Function Properties Page (Edit > Data Function Properties if you lost it), Insert your newly created function, then click Edit Parameters.

With “Input” Selected > choose “Columns” from Input Handler > the Data Table you want to check Nulls on as the Data Table.  This is where you replace your data table with our dummy one.

Then Select Columns and choose the columns you want to search through.  Do NOT hit “OK”.

Hit the “Output” tab, then Data Table for Output handler.  Create a New Data Table with whatever name you choose.

     

Hit “OK”.

Your Data Table with Null Values summed by Column should be ready.  Open a Table, switch the Data Source to your “Null Table” or whatever you chose to name it, and you should see it.

My Eaten column had three missing values, my Price Per Unit, one, which agrees with full data table.

There you have it.  We often give support on DXPs with over 100 columns, which is NOT a recommended best practice.  However, if you have tons of columns and are having an issue that might relate to missing data, this TERR function lets you search them all at once to find that missing data.

So, find those NULLs.  For now, treat them how you wish, whether using Spotfire’s Data panel replacement functionality or removing them altogether.  In subsequent emails, we will discuss more nuanced ways to address them.

Leave a Comment

Your email address will not be published. Required fields are marked *