Support vector machine (SVM) is a supervised learning model with associated learning algorithms which is capable of performing both classification and regression analysis. It is mostly used in classification problems.
A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite-dimensional space, these hyperplanes can be used for classification analysis, regression analysis or even detection of the outlier. when there is a larger separation between the closest vector points of two classes will give us a better separation than less separation between the closest vector points of two classes
The data points that lie closest to the decision surface or the hyperplane are known as support vectors. These are the data points whose class is very difficult to identify. They have a direct effect on the optimum location of the hyperplane. If the hyperplane is removed these support vectors may change their positions.
In the below image (a), we can see that there can be many possible hyperplanes that would divide the datasets into different categories.
SVM finds the optimal hyperplane that maximizes the margin around the separating hyperplane (Shown in Fig. (b)).
The hyperplane is calculated by vector “W” and a constant “b”.
Equation for linearly separable: W*X + b=0
To define an optimal hyperplane, we need to maximize the width of the margin. Support vectors are the critical elements of the training set to define hyperplane. If the hyperplane is removed these support vectors may change their positions.
Implementation of SVM in R
For SVM classifier implementation in R programming language using caret package, we are going to examine a tidy dataset of IRIS (available in R). our motive is to predict the species of the flowers into 3 categories (setosa, versicolor and virginica).
Importing and dividing the dataset into train and test data
The below code is used to (line 8) import and line 13-17 is used to divide the dataset into train and test data. The data will be randomly divided into 2 datasets named iris_train with 70 % of the records and iris_test with 30%of the records. We will see summary of three datasets (line 20-21) to check if data is divided evenly.
In the summary we can see that iris is a dataset with 150 rows and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
Creating model with library e1071
Package e1071 can be used for latent class analysis, short-time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier. We will install the package by command
install.packages("e1071"). Once the package is installed we will import the package to our code by command library(e1071).
svm (dependent Variable ~ independent variable,
data=train_dataset, type='C-classification', kernel='linear').
We’ll use the formula
Species~. which indicates we want to classify the Species dependent variable using the 4 other independent variables in the dataset. We also specify the type of usage we’d like for svm() with
type=”C-classification” for classification (the default). svm() can also be used for regression problems.
summary() function for SVM provides useful information regarding how the model was trained. We see that the model found 40 support vectors distributed across the classes: 7 for setosa, 17 for versicolor, and 16 for virginica.
The linear, polynomial, radial and sigmoid are the variety present in the kernel argument. We will use
kernel=”radial” (the default) for our multi-class classification problem.
You can fine-tune the operation of svm() with below two additional arguments:
cost, where the kernel function uses the gamma argument and the cost of a violation to the margin is specified by the cost argument. When cost is small, the margins will be wide, resulting in many support vectors. we can vary gamma and cost with different values to find the best classification accuracy.
Plot SVM model
The svm() algorithm also has a special
plot() function that we can use to
visualize the support vectors, the decision boundary, and the margin for the
model. below plot helps to visualize a two-dimensional projection of the data
(using the Petal.Width and Petal.Length predictors) with Species classes (shown
in different colours) and support vectors. Slice argument is used to specify a
list of named values for the dimensions held constant.
We will use the predict function on the SVM model to make prediction on the Test data.
we will make use of the
tablefunction to create the confusion matrix to calculate the accuracy of the model.
In confusion matrix the values apart from the diagonal are wrongly predicted.
i.e one of versicolor is wrongly predicted as virginica. Confusion matrix is explained in previous post Logistic Regression with R.
We can calculate the accuracy of the model as 97.78 % by formula – correctly predicted values divided by total predicted values.
Support Vector Machines are a subclass of supervised classifiers that attempt to partition a feature space into two or more groups. To know about the parameters that have not been covered here, you can type “?svm” in the console. SVM was originally proposed by Boser, Guyon, and Vapnik in 1992. They achieve this by finding an optimal means of separating such groups based on their known class. When classes are nearly separable we can use SVM over Logistic model.
if you have not read my previous post, find the link below:
Let me know your views on this post or the topics you would like me to post by commenting below.