Support Vector Machine (SVM) with R

Support vector machine (SVM) is a supervised learning model with associated learning algorithms which is capable of performing both classification and regression analysis. It is mostly used in classification problems.

A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite-dimensional space, these hyperplanes can be used for classification analysis, regression analysis or even detection of the outlier. when there is a larger separation between the closest vector points of two classes will give us a better separation than less separation between the closest vector points of two classes

Support Vectors

The data points that lie closest to the decision surface or the hyperplane are known as support vectors. These are the data points whose class is very difficult to identify. They have a direct effect on the optimum location of the hyperplane. If the hyperplane is removed these support vectors may change their positions.

In the below image (a), we can see that there can be many possible hyperplanes that would divide the datasets into different categories.

(a)

SVM finds the optimal hyperplane that maximizes the margin around the separating hyperplane (Shown in Fig. (b)).

(b)

The hyperplane is calculated by vector “W” and a constant “b”.

Equation for linearly separable: W*X + b=0

To define an optimal hyperplane, we need to maximize the width of the margin. Support vectors are the critical elements of the training set to define hyperplane. If the hyperplane is removed these support vectors may change their positions.

Implementation of SVM in R

For SVM classifier implementation in R programming language using caret package, we are going to examine a tidy dataset of IRIS (available in R). our motive is to predict the species of the flowers into 3 categories (setosa, versicolor and virginica).

Importing and dividing the dataset into train and test data

The below code is used to (line 8) import and line 13-17 is used to divide the dataset into train and test data. The data will be randomly divided into 2 datasets named iris_train with 70 % of the records and iris_test with 30%of the records. We will see summary of three datasets (line 20-21) to check if data is divided evenly.

In the summary we can see that iris is a dataset with 150 rows and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

Creating model with library e1071

Package e1071 can be used for latent class analysis, short-time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier. We will install the package by command install.packages("e1071"). Once the package is installed we will import the package to our code by command library(e1071).

svm (dependent Variable ~ independent variable, data=train_dataset, type='C-classification', kernel='linear').

We’ll use the formula Species~. which indicates we want to classify the Species dependent variable using the 4 other independent variables in the dataset. We also specify the type of usage we’d like for svm() with type=”C-classification” for classification (the default). svm() can also be used for regression problems.

The summary() function for SVM provides useful information regarding how the model was trained. We see that the model found 40 support vectors distributed across the classes: 7 for setosa, 17 for versicolor, and 16 for virginica.

The linear, polynomial, radial and sigmoid are the variety present in the kernel argument. We will use kernel=”radial” (the default) for our multi-class classification problem.

You can fine-tune the operation of svm() with below two additional arguments: gamma and cost, where the kernel function uses the gamma argument and the cost of a violation to the margin is specified by the cost argument. When cost is small, the margins will be wide, resulting in many support vectors. we can vary gamma and cost with different values to find the best classification accuracy.

Plot SVM model

The svm() algorithm also has a special plot() function that we can use to visualize the support vectors, the decision boundary, and the margin for the model. below plot helps to visualize a two-dimensional projection of the data (using the Petal.Width and Petal.Length predictors) with Species classes (shown in different colours) and support vectors. Slice argument is used to specify a list of named values for the dimensions held constant.

Prediction

We will use the predict function on the SVM model to make prediction on the Test data.

we will make use of the tablefunction to create the confusion matrix to calculate the accuracy of the model.

predict(model, newdata=dataset)

In confusion matrix the values apart from the diagonal are wrongly predicted.

i.e one of versicolor is wrongly predicted as virginica. Confusion matrix is explained in previous post Logistic Regression with R.

We can calculate the accuracy of the model as 97.78 % by formula – correctly predicted values divided by total predicted values.

Conclusion

Support Vector Machines are a subclass of supervised classifiers that attempt to partition a feature space into two or more groups. To know about the parameters that have not been covered here, you can type “?svm” in the console. SVM was originally proposed by Boser, Guyon, and Vapnik in 1992. They achieve this by finding an optimal means of separating such groups based on their known class. When classes are nearly separable we can use SVM over Logistic model.

if you have not read my previous post, find the link below:

Let me know your views on this post or the topics you would like me to post by commenting below.

About the author

Gaurav Tiwari

My Name is Gaurav Tiwari. I am working in the IT industry for over 3.5+ years. I completed my B.E. from Mumbai University in 2015, Since then I’m working with Accenture Solutions PVT. LTD. as data Analyst.
I’ve started writing blogs as hobby.

View all posts

2 Comments

Leave a Reply

Your e-mail address will not be published. Required fields are marked *