When building a Multiple Linear Regression model, we want to limit the correlation between predictor (X) variables. Luckily, Spotfire has a tool that makes identifying the correlation (called multicollinearity) effortless. I will walk you through the tool, and you can see the resulting template here.
After you have loaded your data and are ready to begin modeling, go to Tools > Data Relationships. Once you select your data table, choose all of your potential predictor variables but not the target variable in BOTH the Available Y AND X column boxes. Your target variable is what you are trying to predict and the predictor variables are the inputs into the model used for prediction.
After clicking OK, the Data Relationships table loads two visualizations: a table which shows several statistics between each combination of variables and a scatterplot showing a comparison of two variables. In the table, we are most interested in the “R” column, otherwise known as the correlation coefficient.
R will be between -1 and 1 and a high absolute value means the variables are highly correlated, aka their movement upwards or downwards in value is highly related.
The table above shows the correlation between predictors in a housing dataset ordered from highest to lowest. Many of these variables show a strong relationship. This fact can lead to instability in the model as it might be measuring the same effect, which can hurt our understanding of each individual variable’s effect on the ultimate target.
Let’s take a look at the scatterplot for a visual display of what we see in the table. First, we’ll look at the highest correlated variables, Overall Quality and Gr.Liv.Area (sq. footage). All we have to do is mark them in the first table, and they populate in the scatter plot.
Looking at the points, we see an upward trend—as one variable grows, in general so does the other—emphasized by the regression line.
Now, let’s look at the least correlated variables in our table, Overall Condition and Wood Deck SF:
Above, we see zero correlation: a horizontal line. These variables literally have no relationships to each other.
Finally, let’s look at variables with a negative R, Overall Condition and Year Built:
Above, we see what a negative correlation looks like. As the Year Built gets larger, the Overall Condition decreases, which makes perfect sense. Remember: for this exercise we are looking at the absolute value of R, so a negative relationship can have just as negative effect on the model as a positive one.
So, What Now?
So, we have multicollinearity. What can we do about it? First, ask yourselves whether the variables might be redundant, explaining the same thing in different ways. In this case, we can drop one of the variables.
But, also, you can consider combining variables through addition/subtraction, multiplication/division, or a variety of other ways. For example, Gr.Living.Area and Garage Area have an R of .49. Not the most egregious on our list but up there. It makes sense that bigger houses might have bigger garages, but in some cases, a big house might not have a garage at all. While related, the variables are not completely redundant. Should we remove Garage Area, we would be removing information.
Instead, what if we combine them? We can eliminate the distinction between house and garage area by using a calculated column to add them into one number called Total Square Footage.
We can do so by going to Insert > Calculated Colum and using the expression [Gr.Liv.Area] + [Garage.Area].
If we run the Data Relationships tool again and include Total Square Footage, we see something that should be obvious: Total Square Footage is extremely highly correlated with Greater Living Area and Garage Area. Notably, Greater Living Area has more of a relationship to the total, indicating (as we know intuitively) that the house will dictate the total size more than the garage.
Of course, we no longer want to use Garage Area or Greater Living Area in our model. We now have Total Square Footage. This change might in fact hurt our predictions on the target variable because we still lost some information. However, the removal of interactions between variables gives our model much more stability, and we can more safely trust what our coefficients are telling us.
Now that we know how to identify multicollinearity and transform predictors, next steps would be to interpret those coefficients and learn when to transform the target variable.
Now, give it a try on your own data. What variables might you be able to combine? Good luck!
Today’s walkthrough can be found here in template form.
For more on Multiple Linear Regression, try this template.
Jason is a Data Scientist at Petro.ai with a master’s degree in Predictive Analytics and Data Science from Northwestern University. He has experience with a multitude of machine learning techniques such as Random Forest, Neural Nets, and Support Vector Machines. With a previous Master’s in Creative Writing, Jason is a fervent believer in the Oxford comma.