CRISP-DM Evaluation: Train and Test Set

Last week in our Analytics Journey, we worked on variable selection in the Modeling stage of the CRISP-DM method.  Having built a model, it’s once again time to see how it did with the Evaluation stage.  One of the most important parts of evaluating a model comes in properly constructing a training and testing set for evaluation.

So, first, why do we need a train and test set?  At the risk of oversimplifying, I will give a one word answer:  overfitting.

When we build a model, we look at metrics like Adjusted R-squared as we did in our last installment.  We try to select variables and use techniques that will maximize the Adjusted R-squared.  However, we run the risk of building a model too specific to the dataset we build the model on.  When this happens, the model won’t generalize to unseen data well.  So, instead, we set aside a test set on which we will double check our model.

General best practices are to use somewhere between 70 and 90% of your data for training and the rest for testing.  This number depends on how many observations you have.  If you have a ton of data, you might lean towards the 70% training figure to ensure a robust testing set.  However, if you have less data, you might need 90% in the train set to properly find a nuanced model.

So, how do we go about dividing our test and train sets?  First, the selection should be random.  We don’t want our training set to be all from a similar space, like easrlier data.  Further, if we have a categorical target variable, we want to make sure our sampling results in a similar percentage of each category in both the test and train.  In other words, if we have a target variable that is either yes or no and 70% of our observations are yes, we want that ratio to hold in both the train and test sets.

We can use Spotfire’s “Rand” function to create calculated columns for both the train and test sets.  First, we will crreate an index.  Go to Insert > Calculated Column.  Call the column Train.  Use the expression if(Rand(123) < 0.9, True, False) as our index.  The 123 in the function is called the seed, which allows for reproducibility in the random sampling.  It can be any number.

Next, we will create columns for the train and test set target variables using that index.  Go to Insert > Calculated Column.  Name one Target _Train and put in the expression if([Train], [Target], null).  Name the other Target_Test and put in the expression if([Train], null, [Target]).  Basically, we have set an index column for Train.  If that column is true, the Train column will have data.  If not, the Test column will.

Now, we will use the Target_Train as the target variable to build our model.  It will ignore the test set observations because the Target_Train variable will be NULL.

Evaluation:  RMSE

After the train/test sets have been created and model building has finished, we will see how the model performed on the train and test sets.  We can use one of many metrics, but one of the most popular is Root Mean Squared Error (RMSE), which squares the differences between predicted and actual observations, adds them, then takes the square root.

First, use your model to predict both the Target_Train and Target_Test variables.  Then, create a calculated column for Train_RMSE and Test_RMSE and use Sqrt(Avg(([Target_Train] – [Train_Predict]) ^ 2)) (replacing Train with Test for the test set).

Ideally, our model building has left us with 3 or 4 competing models.  Perhaps they use different algorithms (Linear Regression, Random Forest, Neural Network, etc).  Perhaps they are the same algorithm using different variables or variable transformations.  Do the above RMSE process for each model in order to compare them.

There is no general guideline for a good RMSE since the metric depends on the scale and variance of the data.  What we are interested in is comparing the models to each other, both in train and test sets.  A lower RMSE in the test sets is ideal.  However, if there is a big difference for a model between the train and test sets, that reveals potential overfitting issues.  So, we might prefer a model with a slightly higher error (RMSE) in the test set if it shows more stability from the train to test set.  One best practice is to also calculate the percentage change from train to test.

Setting up a train and test set represents one of the most important steps avoiding overfitting in model building.  Unfortunately, there are no hard and fast rules for what is a good error metric or how much change between train and test is permissible.  Ironically, in an analytic field, modeling still contains a bit of artistry.

We can only begin to understand the nuances by doing, so get out there and build yourself a train and test set to see how it works!

Also, check out our Data Science Toolkit extension, which automates much of the train/test process highlighted above.