- Do you use Chrome for all of your Internet browsing and hate having to switch back to IE for Spotfire?
- Would you like to be able to open Spotfire Web Player files in Chrome?
When building a Multiple Linear Regression model, we want to limit the correlation between predictor (X) variables. Luckily, Spotfire has a tool that makes identifying the correlation (called multicollinearity) effortless. I will walk you through the tool, and you can see the resulting template here.
Hello, good friends. The next step in our Analytics Journey takes us to the second iteration of Data Preparation. This is the third step of the CRISP-DM method.
Today, we are going to look at one of the most common data quality issues in Spotfire: the NULL, aka missing values. While there are many ways to address NULL values like imputation (a lesson for another day), the first step is simply identifying them. We will walk through Spotfire’s built in NULL identifier and also a more advanced TERR based method.
For the first Data Understanding stage installment in our Analytics Journey, we explored Simpson’s Paradox in the survival statistics from the Titanic to highlight why the Data Understanding stage proves so important in the CRISP-DM process. This week, we will use the same dataset and demonstrate how Spotfire’s unique Marking and Filtering capabilities make the Data Understanding stage much more efficient and powerful.
Linear Regression models are the simplest linear models available in statistical literature. While the assumptions of linearity and normality seem to restrict the practical use of this model, it is surprisingly successful at capturing basic relationships and predicting in most scenarios. The idea behind the model is to fit a line that mimics the relationship between target variables and a combination of predictors (called independent variables). Multiple regression refers to only one target variable and multiple predictors. These models are popular not only for solving the prediction task but also for working as a model selection tools allowing to find the most important predictors and eliminate redundant variables from the analysis.
Everyone who has ever owned or lived in a house knows at least a little bit about the whims of the real estate market. Big houses cost more, neighborhood matters, proximity to basic services is great, age and style are important in some markets, you name it. But what is it that matters the most? This is a question that visualization can help us answer.
I received an interesting request from a user that deserves sharing. The user requested a visualization showing the curve of a normal distribution of data points. Now, just to be clear, a visualization that shows the distribution of data points is a histogram, which looks like this:
The histogram might vary a little bit if you change the number of bins being used, but it always has the continuous value along the X-axis and the (Row Count) on the Y-Axis. However, the user didn’t want to see the bars of the histogram, just a curve that represented the histogram, which would look like this:
Normal Distribution Curve
This type of visualization is simple and easy to create in Spotfire using the following steps.
Creating the Visualization
- Add a bar chart
- Configure the X-Axis with the continuous value and the Y-Axis with (Row Count)
- On the X-Axis, click the down arrow on the axis selector and make sure the “Auto-Bin” box is checked.
- If needed, right click on the axis selector and choose “Number of Bins” to set the desired number of bins.
- In the legend, click on the color circle and color the bars the same color as the background (probably white).
- Go to Properties > Lines & Curves > Add > Gaussian Curve fit
BAM! Done! The Gaussian Curve fit is the normal distribution and represents the histogram as a curve. If you combine the curve and the histogram, it looks like this:
In the end, Spotfire had the functionality to quickly and easily meet the user’s needs!
As Rustin Cohle said in True Detective, “Time is a flat circle,” so welcome back to the beginning of our Analytics Journey! Previously, we cycled through the CRISP-DM process from beginning to end, explaining the stages as well as the way we approach our Data Science life cycle at Ruths.ai. We have strived to demonstrate the importance of melding the human element with quantitative rigor. Now, we will re-iterate through the steps as all good analytics processes will do, looking for ways to strengthen our model. This time through, we will move from the theoretical to practical with an eye towards enacting the stages in the real world.