Welcome to the next installment of our Analytics Journey, which explores how we at Ruths.ai apply the CRISP-DM method to our Data Science process.   Previously, we looked at an overview of the methodology as a whole as well as the Business Understanding and Data Understanding stages.  Next, we examine the stage of Data Preparation.

Data Preparation

As mentioned in earlier posts, the Data Preparation stage consists mainly of three parts:  selecting the data, cleaning the data, and constructing the data.  We want to determine the dataset we will be working with (Data Selection), clean errors and missing values (Data Cleaning), and manipulate the data into the proper format (Data Construction).  Today, we will specifically discuss Data Selection and the need to clearly define our goals and model space.

In the age of Big Data, we have seen a trend towards data worship.   We now have the ability to obtain and store massive amounts of data affordably, and media hype abounds with use cases of brilliant (and sometimes dangerous) discoveries made, all owed to Big Data.  Have you heard about the Strawberry Pop-Tarts ?  Or the Target pregnancy controversy?  Big Data, at times, has become a Holy Grail, in the positive and negative sense.  While our new data capabilities undoubtedly can bring great value, one can also suffer great misfortune in the quest for N = All, ie the scenarios where we can obtain every datapoint that exists.

While high level Machine Learning problems do in fact benefit from massive amounts of data, such quantities can also obscure trends in more traditional data projects.  Do not chase more data for the sake of data.  Instead, strategically define your modeling objective.  Only when we know what we want to achieve in a project can we efficiently start the process of Data Selection.

For example, in one of our training classes, we provide a housing dataset and task the students with predicting typical single family home sale prices.  Quickly, outliers become obvious.  Upon further inspection, we see that the outliers occur in homes with lots over 50,000 square feet.  These are not typical single family homes.  In this instance, if the students feel obligated to leverage all of the data provided, they will introduce a great deal of bias.  However, by setting limits they can hone in on the task in hand.

When we first look at data, we can easily get lost down the rabbit hole.  Do not feel the need to search every ounce of your data to the ends of the earth.  Instead, first define your problem.  Decide what data remains relevant.  Then, select your data accordingly.

Weed out the noise.

The Data Selection portion of Data Preparation can save you from many hours starving in the data wilderness and instead prepare you to drink from the cup made possible by a focused expedition.

Written by Jason May
Jason is a Junior Data Scientist at Ruths.ai with a Master’s degree in Predictive Analytics and Data Science from Northwestern University. He has experience with a multitude of machine learning techniques such as Random Forest, Neural Nets, and Hidden Markov Models. With a previous Master’s in Creative Writing, Jason is a fervent believer in the Oxford comma.