Linear Regression, the simplest Machine Learning Model

Linear Regression models are the simplest linear models available in statistical literature. While the assumptions of linearity and normality seem to restrict the practical use of this model, it is surprisingly successful at capturing basic relationships and predicting in most scenarios. The idea behind the model is to fit a line that mimics the relationship between target variables and a combination of predictors (called independent variables). Multiple regression refers to only one target variable and multiple predictors. These models are popular not only for solving the prediction task but also for working as a model selection tools allowing to find the most important predictors and eliminate redundant variables from the analysis.

Using the Template

To fit the model we usually find the closest line to all training points by minimizing the least squared measurement. Other objective functions can be used to capture a better bias-variance trade off. A very popular approach consists in considering an empirical penalty combined with least squared.

The Multiple Linear Regression template offers a great introduction to this method and the concepts behind predictive models. Let’s look at all the features:

Step 1

First, let’s load in some new data, (ostensibly your data).

1. Select File > Replace Data Table…
2. Choose the data table you wish to replace.
3. Select Source to replace with by clicking on the Select button and selecting the source of your choice.
4. Click OK.

Step 2

The “Exploratory Analysis” tab shows the distribution of the independent variables in a Box Plot. A statistics summary table and a hypothesis testing for mean comparisons are shown. Step 3

Simple Linear Regressions are performed for the target variable and each of the independent variables y~Xi, where Xi represents each of the predictors. This analysis will allow us to select the variables that are highly correlated with the target. Step 4

Now we will need further analysis to make sure we don’t include redundant variables in the study.

1. Go to Tools > Data Relationships > Select the predictor variables in both windows and choose Spearman Correlations. A simple way to eliminate redundant variables is :

1. Rank all the predictor based on the r2 from Step 3. Then add the variable with highest r2 to the model.
2. Next, start adding in order variables to your model. Use the Spearman Correlation Coefficient as a second criterion for including variables in the model. If the variable is highly correlated with other predictors already included in the analysis we recommend excluding them.

Step 5

Because all the independent variables were selected we can perform the multiple Regression Analysis in the “Regression Summary” tab.

1. Go to Edit > Data Function Properties > Edit Parameters > Select columns for “my.df” data input first adding the target column and then adding the independent variables. Click OK and close the window.
2. In the text Area to the left Run the Analysis by clicking the Regression button.
3. The Statistics data table shows the results of the model performance.
4. The summary data table gives the position measurements of Residuals, Effects and Predictors.
5. Next, we compute pairwise variances across the predictors and the intercept.
6. We visualize the predicted values and their confidence intervals using a line chart. Step 6

In the “Prediction Errors” tab we visualize the target variable, the predicted values and the residuals. Step 7

Furthermore, the “Factors Summary” tab now shows a summary of the Effects, the Standard Errors and confidence intervals per factor. Now you have taken your first step to become a Machine Learning expert! Most complicated methods and algorithms use the objective function and the linear model concepts introduced in this template. Feel free to go deeper on linear models now that you have the basics.

Here is the link to our template.

Enjoy!