In Data Science, prior to using any analytics we always face the problem of having missing values. Deciding when to delete cases or fill missing values is totally dependent on the data set and the target problem. A general practice consists on deleting variables and samples with more that 30% of missing values and then use Multiple Imputation techniques to fill the remaining missing values. In Excel, we can use the replace tool or the filter to approach this problem, and even use Visual Basic to code a more customized solution. In Spotfire, we have the advantage of using more advanced methods by accessing R libraries that contain MCMC, Bayesian and Multivariate Algorithms. Spotfire’s integrative tools really make a difference on how to approach the missing values problem: putting together advanced algorithms, amazing visualizations and user interactivity.
We’ll walk you through the “Multivariate Missing Values” template that accounts for the severity of missing values by rows and columns in a data frame while using the MICE R library to offer several Multivariate Imputation Methods for filling missing values.
1. Loading your data table
Your data needs an ID column for the samples called “PID” (this column shouldn’t have any missing values). The rest of the columns don’t need to have specific names. You can replace the data table called “Trend Data”. In the Home page of the Dashboard check that you loaded your data.
2. Moving to the “Missing Columns” tab
You will see a summary of the missing values per columns. The first data table shows the total number of non-missing values per column and that information is represented in the bar graph to the left bottom.The Horizontal line in the bar graph shows what variables have more than 30% of missing values. Similarly, the Waterfall Chart shows in red the variable with more that 30% missing values, in Yellow the variable with less that 30% absences that should be fill out with multiple imputation and Green represents the variables with non-missing values.
3. Missing Values
The “Missing Rows” tab allows the user to run a summary of the data. Click the Run Summary button to show:
- Number of complete cases
- Number of rows with less than 30% missing values
- Number of columns with less than 30% missing values
- Number of rows with more than 30% missing values
- Number of columns with more than 30% missing values
It also shows a tree map that visualizes rows with non-missing values (blue), rows with less than 30% missing values(yellow), and rows with more than 30% missing values(red).
4. Manage Column Output
Using the information from Missing Columns tab. Go to Edit > Column Properties > Delete columns with more than 30% missing values for the Ames Housing data table. In the Missing Rows tab go to the Filter in the text area and filter out rows with more than 30% missing values. Go to the Imputation Methods tab to fill out the remainder missing values. First choose the desired method and then run the R data function by clicking the button “Fill the missing Values”.
The method choices are:
- Pmm: Predictive mean matching
- Cart: Classification and regression trees
- Rf: Random Forest
- Sample: Random sample from the observed values
- lonly.pmm: Imputation at level-2 by Predictive mean matching
Once the missing values problem is solved you can add any data analysis and visualizations to the dashboard. Feel free to add your own calculations and make sense of your data remember to use the data table called XResults that contains the Multiple imputation results.
Msc. Dunia Giniebra