Welcome to the next installment of our Analytics Journey, which explores how we at Ruths.ai apply the CRISP-DM method to our Data Science process. Previously, we looked at an overview of the methodology as a whole as well as the Business Understanding, Data Understanding, and Data Preparation stages. Next, we examine the Modeling stage.
We have finally reached the fun part! We have reached the step where we can move from a descriptive look back to a predictive look forward. In the Data Understanding phase, we discussed the 80-20 rule, which states that data professionals spend 80% of their time cleaning data, akin to the tedious hours of practice preparing for the big game. Hopefully, we have shown through phenomena like Simpson’s Paradox that even the Data Understanding/Preparation stages can bring intriguing insight; however, the Modeling stage represents the true opportunity and most intellectually stimulating phase.
What is Modeling?
Still, the Modeling stage might intimidate those only beginning to apply analytics. Some might feel comfortable exploring the data but lose faith when they have to make decisions on anything more complex. We will save the more nuanced and complex model discussion for later posts and today simply ask, “What is a model?”
Einstein’s famous E = mc2 is a comprehensive yet consistent statistical model of how mass and energy relate, which explains most of the observed physical behavior in the universe.
My two year old daughter’s imploration of “Pleeeeeeeeease” is also a model, one built around my behavior and threshold for denying her cuteness. Actually, “Pleeeeeeeeease” is the action she takes based on the model she has built in her head that says, “If I yell please, I will get what I want.”
Yeah, that whole teaching of manners thing backfired.
The larger point remains: a model is simply an artificial representation of something that occurs in the world. Models can be complex, they can be simple; they can be very accurate, they can be very wrong.
But, if you were Rip Van Winkle and woke up to discover a traffic light for the first time, would you not want to build some sort of model in your head that says that the light turns red after yellow and green after red so that you do not get squashed by a car? Modeling simply quantifies that decision-making process rather than leaving it to the guessing gut.
As a Data Scientist, I certainly do not advocate making decisions based on one touch cookie cutter model creation, even when coming from trusted platforms. Responsible use comes from understanding at least the basics of how models work.
However, the basics need not be inaccessible. Further, good Data Scientists and Citizen Data Scientists can make understanding even more attainable by effectively and efficiently communicating model results.
Models do not have to be complex.
My two year old already has figured that out.
Jason is a Data Scientist at Petro.ai with a master’s degree in Predictive Analytics and Data Science from Northwestern University. He has experience with a multitude of machine learning techniques such as Random Forest, Neural Nets, and Support Vector Machines. With a previous Master’s in Creative Writing, Jason is a fervent believer in the Oxford comma.