Welcome to the next installment of our Analytics Journey, which explores how we at Ruths.ai apply the CRISP-DM method to our Data Science process. Previously, we looked at an overview of the methodology as a whole as well as the first step, Business Understanding. Next, we examine the stage of Data Understanding.

**Data Understanding**

For some people, the Data Understanding stage does not hold the allure of modeling or implementation; however, expending the effort now can lead to unexpected insight and prevent misleading results later. The 80/20 rule says that 80% of a Data Scientist’s time occurs in this stage (and the next stage of Data Preparation) while only 20% involves actual modeling. Think of the Data Understanding stage as the hours of practice before the big game.

The Data Understanding stage involves data collection, description, and exploration. We must collect the data in an orderly and veritable way. We must use summary statistics to describe the nature and distributions of the data. And, we must perform an Exploratory Data Analysis (EDA) to discover first insights.

While the necessary heavy lifting of Data Understanding can feel tedious, I want to challenge that conventional wisdom. To do so, I will introduce Simpson’s Paradox and the thrilling, counter-intuitive, head scratching truths that we can discover with a commitment to rigorous Exploratory Data Analysis.

**Simpson’s Paradox**

Simpson’s Paradox describes the phenomenon that occurs when an apparent trend in the data reverses at a group level. The paradox exists in a wide range of statistical fields from Gender Studies to baseball batting averages. So, why should we care? In layman’s terms, Simpson’s Paradox occurs when summary statistics mislead us, implying the exact opposite of the true nature of the data. For decision makers, acting on such data could be disastrous. For the astute data professional, identifying such a situation could be a game changer.

One of the more famous instances of Simpson’s Paradox involves the survival rates from the sinking of the Titanic. 3rd Class passengers survived at a 24.2% rate while crew members survived at a 23.6% rate.

Upon hearing these statistics, one would of course hope to be a 3rd Class passenger given the choice (if an evil time traveler kidnapped said person and made them board the ship). At the very least, someone might say they do not care because the percentages are so close.

However, digging into the data more closely tells a different story.

Above, we see that 50% of 3rd Class females survived as opposed to 76.5% of the female Crew and 13.5% of 3rd Class males survived as opposed to 22% of the male Crew. So, both male and female Crew members had a much better chance of survival than their 3rd Class counterparts of the same gender, even though overall the 3rd Class survived at a higher rate. How is this possible?

To understand the reversal, we need to look at how many of the females versus males survived as well as what percentage of each were in the 3rd Class and Crew.

In the above graphs, we see that 52.8% of females survived versus 18.8% of males. In the aftermath of the crash, the people on the Titanic honored a true women and children first policy in disembarking to safety. Further, we see that the 3rd Class consisted of 29.3% females while the Crew consisted of only 2.9% females. Thus, the 3^{rd} Class had a much higher percentage of the group that survived at a much higher rate.

This disparity of groups caused the Simpson’s Paradox. While true that a higher percentage of 3^{rd} Class members survived, this fact largely occurred because they had more people with a higher chance of survival. We call a variable that causes such a reversal, in this case gender, a confounding or lurking variable.

The Simpson’s Paradox involving the Titanic survivors clearly demonstrates the vital nature of EDA. Had we accepted the summary statistics at face value, we could have made faulty assumptions about the events of the Titanic disaster. We might have concluded that something about being in 3^{rd} Class led to a greater survival rate.

Spotfire’s dynamic Marking and Filtering system makes searching for unexpected data trends and lurking variables as easy as the click of the mouse. We can act on a hunch, use industry knowledge, or put suspect results to the test.

So, always remember to explore your data thoroughly. Things are not always what they seem and a rigorous commitment to the Data Understanding process can reveal deeper truths about the data, preventing errant assumptions that can sabotage the modeling process before it starts.

Jason is a Data Scientist at Petro.ai with a master’s degree in Predictive Analytics and Data Science from Northwestern University. He has experience with a multitude of machine learning techniques such as Random Forest, Neural Nets, and Support Vector Machines. With a previous Master’s in Creative Writing, Jason is a fervent believer in the Oxford comma.

Pingback: CRISP DM: Deployment - Data Shop Talk

Pingback: CRISP-DM Data Understanding: Marking and Filtering - Data Shop Talk