CRISP-DM Data Understanding: Marking and Filtering

For the first Data Understanding stage installment in our Analytics Journey, we explored Simpson’s Paradox in the survival statistics from the Titanic to highlight why the Data Understanding stage proves so important in the CRISP-DM process.  This week, we will use the same dataset and demonstrate how Spotfire’s unique Marking and Filtering capabilities make the Data Understanding stage much more efficient and powerful.

Data Understanding:  Marking and Filtering

Remember that Simpson’s Paradox represents the phenomenon that occurs when an apparent trend in the data reverses at a group level.  In addition to the last Data Understanding example involving survivors, the Titanic data also contains a smaller example of the paradox with age and fares.  We will show one way of setting up visualizations to discover great insights with only a few clicks.

Looking at the Titanic data as a whole,  a positive correlation exists between passenger age and fares.  As the passenger age goes up, so do the fares.

We added a regression line to represent the correlation between age and fares.  This can be done by going to Properties > Lines and Curves and adding a Straight Line Fit.  We see that although slight, the regression line goes up, indicating a positive correlation.  As age goes up, so do fares?

So, why was that?  Were the people who determined ticket prices ageist?  Or was there a lurking variable, a confounding factor, at play?  Marking and Filtering will help us explore the problem.

To prepare our exploratory data analysis (EDA), we created three visualizations:  one to show the global correlation, one to mark on depending on a variable, and one to show the effects of the marking.

The top visualization uses the green “Marking” as its marking (Properties > Data > Marking).  The bottom left visualization has the same marking set.  For the bottom right marking, we checked the box to “Limit data using markings.”  This means that the third visualization will only show data points that were marked in one of the first two visualizations.

Note, the bottom left visualization shows a Bar Chart of the classes that existed on the Titanic.  Now, when we mark a class by clicking it, the bottom right visualization shows the Scatter Plot with the regression line for only the variables marked.

With only a few clicks of the mouse, we can explore the effect the classes have on the age / fare relationship.  And with minimal time investment, we realize that something strange occurs for each class.

When we examine the classes individually, the regression line reverses and shows a negative relationship between age and fare.  As age goes up, fares go down, most notably in the 3rd Class.

How could that be?  We refer you to the last installment of Data Understanding involving Simpson’s Paradox.

For now, we want to focus on the larger point:  Spotfire’s unique marking feature has allowed us to dig into a surprising initial result (higher age leads to higher fares) and realize that things are not what they seem.

With a click of the mouse, we see that the data behaves differently depending on subsets of the data.

Setting up comparative visualizations using marking allows us to achieve deeper insights.  And, of course, similar insights could be found by filtering the classes using the filter panel.

Don’t take your data at face value.  Explore it.  Challenge the assumptions.  You can even work on hunches.  In this case, since exploratory, a hunch is not a bad word.  As long as you are willing to accept  seemingly contradictory results.

Understand your data.

Leave a Comment

Your email address will not be published. Required fields are marked *