## 3. Support vector machine (SVM)

SVM is a supervised learning algorithm that is mostly used for classification tasks.

SVM generates a decision boundary to separate the classes. Before creating the decision boundary, each observation (or row) is plotted in n-dimensional space (n is the number of features). For instance, if we use “length” and “width” to classify different “cells”, observations are plotted in a 2-dimensional space and decision boundary is a line.

Support vectors are the closest data points to the decision boundary. Decision boundary is drawn in a way that the distance to support vectors are maximized.

If the decision boundary is too close to a support vector, it will be highly sensitive to noises and not generalize well. Even very small changes in features may change the prediction.

If the data points are not linearly separable, SVM uses a kernel trick which measures the similarity (or closeness) of data points in a higher dimensional space. SVM with a kernel does not actually transform the data points into a higher dimensional space which would not be efficient. It only measures the similarity in a higher dimensional space.

A standard SVM tries to separate all data points in different classes and does not allow any points to be misclassified. This results in an overfit model or, in some cases, a decision boundary cannot be found with a standard SVM.

A better alternative is the soft margin SVM which tries to solve an optimization problem with following goals:

- Increase the distance of decision boundary to classes (or support vectors)
- Maximize the number of points that are correctly classified in training set

The algorithm allows for misclassification of some data points in order to determine a more robust and generalized decision boundary.

There is a trade-off between the two objectives listed above. This trade-off is controlled by C parameter** **which adds a penalty for each misclassified data point. If c is small, penalty for misclassified points is low so a decision boundary with a large margin is chosen at the expense of a greater number of misclassifications. If c is large, SVM tries to minimize the number of misclassified examples due to high penalty which results in a decision boundary with a smaller margin. Penalty is not same for all misclassified examples. It is directly proportional to the distance to decision boundary.