Support Vector Machines (SVMs) is one of the most popular and most widely used machine learning algorithms today. It is robust and allows us to tackle both classification and regression problems. In general, SVMs can be relatively easy to use, have good generalization performance, and often do not require much tuning. Follow this link for further information regarding support vector machines. To help illustrate the power of SVMs, we thought it would be useful to go through an example using a custom template we have created for SVMs.
For this example, we will use the template and the iris data-set to build a model to predict the Species of a flower given its Sepal Length, Sepal Width, Pedal Length and Petal Width. The data that we used is embedded in the SVM dxp or is also available here through the UCI machine learning repository.
Setup and Data Load
We start by installing the required packages needed for full functionality in this tool. To do so, we simply click the “Install TERR Packages” button located on the left side of your screen.
Next, we load our data into our analysis by following the instructions provided in the ‘Data Load’ page. Here we replace the master table with the data set that will be used to train the model. SVM is flexible and can handle both numeric and categorical attributes. (Note: The iris data is pre-loaded in the dxp, thus this step can be skipped if using the provided data set)
Once the data has been loaded we can proceed to the “Train SVM” tab. Here, we must first complete a couple more set-up steps that only need to be done when new data is loaded. Click the “Get Column Names” button to update the column attributes and enter the path to the R.exe, which is typically found in “C:\\Program Files\\R\\R-3.3.2\\bin\\R”. Remember to use double backlash ‘\\’ instead of single given that ‘\’ are escape characters in R.
Now we can begin setting up the inputs and parameters to train our model. Select the predictor variables for you model. For this example, we will leave all the variables selected. Next, select the response variable. This is the variable that you want to predict. Classification or Regression SVMs will automatically be executed depending on the type of response variable you select.
Now we must choose how to deal with missing values in the data set, if they exist (Note: Select na.omit if data-set has no missing values). The first option is to choose to remove all rows which contain missing values using the na.omit function. The second option is to impute the missing values using the Random Forest. The missForest imputation can handle mixed-type data and can be used to impute continuous and/or categorical variables.
Lastly, we must set up the SVM tuning parameters called Kernel, Cost, and Gamma. The kernel is essentially a similarity function that is used by SVM algorithm. The radial kernel is a good default kernel if unsure which one to choose. The cost parameter is the penalty associated with the incorrect classification of each training example. Setting the cost value too high may cause over-fitting. The default value for cost should be set to 1. The gamma parameter defines the influence a single training example reaches. The default value for gamma should be set to 1/(number of columns). For the purposes of this example, the Kernel, Cost, and Gamma values will be set to Radial, 1, and 0.01 respectively. Once all the tuning parameters are set we can click the “Run SVM” button to run/train the model.
After running the model, you will see a pairs plot visualization (Predicted vs Response if Regression is performed) on the left-hand side of your screen. In addition, you will see evaluation metrics including model Accuracy and Error rate, and a confusion matrix (Not available for regression). Our results are shown below.
Here we see that our model is predicting correctly with an accuracy of 91% and can also observe that there were 6 and 5 misclassifications in the Iris-versicolor and Iris-virginica classes, respectively. You can repeat the steps above and modify the different parameters until you are satisfied with the accuracy of your trained model. The “Training Results” tab can be used further to explore the training details of your model.
Once you have trained a model, you can use the trained model to predict on new data sets. To do so, you can go to the “Predict” tab and follow the instructions to bring in new data by replacing the “PredictTable” data table. Remember that the new data that you bring in must contain all the same predictor columns that you used to train your model. Once this step is complete, you can click the “Run SVM Predict” to predict on the new data set. The visualizations on the right can be used to explore the results of the predicted values in detail.
This concludes the example. You are now using Support Vector Machines to make predictions using new data! This same analysis workflow can be applied to any data-set in general making this a very powerful tool.
I hope that you have found this blog post to be informative and useful in getting you started using Support Vector Machines. If you have any questions, feel free to comment below.
Credits: The SVM template and blog were created by Emanuel Vela (Co-author) & Nitin Chaudhary (Co-author).
Emanuel holds a Master of Science Degree in Data Science and Analytics from the University of Oklahoma and brings years of experience in production engineering.