CRISP-DM Modeling: Forward and Backward Selection

Welcome back everyone to our Analytics Journey series. We will try to return to a semblance of normalcy by continuing where we left off in our journey.

With all of our hard work in understanding and preparing the data during previous steps of the CRISP-DM method–exploring data, choosing a model space, removing NULLs, removing Multicollinearity–it’s time to have some fun with the Modeling stage.  Today, we’ll look at an aspect of Multiple Linear Regression:  Forward and Backward Selection.

CRISP DM Data Preparation: Finding and Counting NULL Values in Spotfire

Hello, good friends.  The next step in our Analytics Journey takes us to the second iteration of Data Preparation.  This is the third step of the CRISP-DM method.

Today, we are going to look at one of the most common data quality issues in Spotfire:  the NULL, aka missing values.  While there are many ways to address NULL values like imputation (a lesson for another day), the first step is simply identifying them. We will walk through Spotfire’s built in NULL identifier and also a more advanced TERR based method.

CRISP-DM Data Understanding: Marking and Filtering

For the first Data Understanding stage installment in our Analytics Journey, we explored Simpson’s Paradox in the survival statistics from the Titanic to highlight why the Data Understanding stage proves so important in the CRISP-DM process.  This week, we will use the same dataset and demonstrate how Spotfire’s unique Marking and Filtering capabilities make the Data Understanding stage much more efficient and powerful.

Linear Regression, the simplest Machine Learning Model

Linear Regression models are the simplest linear models available in statistical literature. While the assumptions of linearity and normality seem to restrict the practical use of this model, it is surprisingly successful at capturing basic relationships and predicting in most scenarios. The idea behind the model is to fit a line that mimics the relationship between target variables and a combination of predictors (called independent variables). Multiple regression refers to only one target variable and multiple predictors. These models are popular not only for solving the prediction task but also for working as a model selection tools allowing to find the most important predictors and eliminate redundant variables from the analysis.

Real Estate Secrets: Hidden Trend Visualization

Everyone who has ever owned or lived in a house knows at least a little bit about the whims of the real estate market. Big houses cost more, neighborhood matters, proximity to basic services is great, age and style are important in some markets, you name it. But what is it that matters the most? This is a question that visualization can help us answer.

Normal Distribution Curve on a Visualization

I received an interesting request from a user that deserves sharing.  The user requested a visualization showing the curve of a normal distribution of data points.  Now, just to be clear, a visualization that shows the distribution of data points is a histogram, which looks like this:



The histogram might vary a little bit if you change the number of bins being used, but it always has the continuous value along the X-axis and the (Row Count) on the Y-Axis.  However, the user didn’t want to see the bars of the histogram, just a curve that represented the histogram, which would look like this:

Normal Distribution Curve

curve only

This type of visualization is simple and easy to create in Spotfire using the following steps.

Creating the Visualization

  1. Add a bar chart
  2. Configure the X-Axis with the continuous value and the Y-Axis with (Row Count)
  3. On the X-Axis, click the down arrow on the axis selector and make sure the “Auto-Bin” box is checked.
  4. If needed, right click on the axis selector and choose “Number of Bins” to set the desired number of bins.
  5. In the legend, click on the color circle and color the bars the same color as the background (probably white).
  6. Go to Properties > Lines & Curves > Add > Gaussian Curve fit

BAM!  Done!  The Gaussian Curve fit is the normal distribution and represents the histogram as a curve.  If you combine the curve and the histogram, it looks like this:

curve and histogram


In the end, Spotfire had the functionality to quickly and easily meet the user’s needs!