Welcome back everyone to our Analytics Journey series. Those of us in Houston have been through a trying time, and our thoughts are with the community. We will try to return to a semblance of normalcy by continuing where we left off in our journey.
With all of our hard work in understanding and preparing the data during previous steps of the CRISP-DM method–exploring data, choosing a model space, removing NULLs, removing Multicollinearity–it’s time to have some fun with the Modeling stage. Today, we’ll look at an aspect of Multiple Linear Regression: Forward and Backward Selection.
Multiple Linear Regression means we will be using more than one predictor variable, so one of the first questions to ask is which variables to input into our model. Forward and Backward Selection allow us to include or exclude variables in an algorithmic way. Though more advanced algorithms exist to automate these methodologies in languages like R and Python, today we want to focus on the conceptual underpinnings of the selection methods.
We will use the Ames Housing data and Spotfire’s Regression Modeling tool to illustrate the process. We will try to predict Sale Price and use correlation to Sale Price as a crude indicator of promising predictors. Note: in reality, much more work should be spent exploring the predictor variables.
First, I will go to Tools > Regression Modeling and choose Sale Price for my Response column. In the Predictor columns list box, I will choose Overall Quality, my most promising predictor. When I select the response and predictor, they populate in a Formula expression box at the bottom of the window.
When I hit OK, a new page populates with model results, a table of coefficients, a Residuals vs Fitted plot, and Variable Importance plot. A full discussion of these results will have to wait for our next installment, the Evaluation stage. For now, we want to focus on choosing the variables.
For simplicity, we will use as our comparison metric Adjusted R-squared, a common metric for judging model efficiency which penalizes for complexity. A general rule in modeling is that a simpler model is preferable to a more complex model with a similar predictive ability. KISS: Keep It Simple Stupid.
Our Adjusted R-squared for our Simple Linear Regression model (1 predictor variable) is .6451. Now, we will move Forward, adding in our next most promising variable, Greater Living Area. To edit our model, we hit the calculator icon at the top of the Model Summary window.
With Greater Living Area included, our Adjusted R-squared improves to .7678. A higher R-squared indicates an improvement, so we will keep both variables.
We will then systematically move Forward, adding variables and assessing their impact. We add Total Basement Square Footage then Garage Area and achieve a .8355 Adjusted R-squared. But, when we add Full Bath, the Adjusted R-squared decreases to .8354. Two Notes:
- Though this represents a tiny decrease, we stated earlier that we prefer model simplicity over complexity. Industry knowledge might trump this concern if the experts deem the variable important.
- Full Bath causing a negative effect on the model doesn’t necessarily mean it has no predictive power on Sale Price. However, other more powerful variables might have captured the same effect as Full Bath (like Greater Living Area), rendering the effect of Full Bath superfluous.
Since Full Bath had a negative effect, we will not include it in our model. Instead, when we once again edit our model using the calculator icon, we will make sure to only use previously included variables. Then, we will move Forward again, adding Total Rooms Above Ground. This change results in an improved Adjusted R-squared of .8366.
We can continue the process until no more variables help. Since we desire simplicity, we will only add variables that demonstrate a tangible improvement on the model.
So, there you have it! The Forward selection method. What is Backward selection, then? Simply, the process in reverse: start with all potential variables and remove them one at a time, only returning them to the model if the Adjusted R-squared decreases.
A word of caution: don’t think that maximizing Adjusted R-squared represents the end all be all for a model. Modeling aspects like normality, multicollinearity checks, and training/test sets still should be utilized. And overfitting still represents a danger.
Still, this represents a basic method for variable selection. Give it a try, yourself. See how different variables in conjunction might affect the model differently.
Next week, we will look to more advanced model selection methods in the Evaluation stage.
Jason is a Data Scientist at Ruths.ai with a master’s degree in Predictive Analytics and Data Science from Northwestern University. He has experience with a multitude of machine learning techniques such as Random Forest, Neural Nets, and Support Vector Machines. With a previous Master’s in Creative Writing, Jason is a fervent believer in the Oxford comma.