Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression tasks. However, it is mainly used for classification problems.
SVM works by finding the best possible boundary (hyperplane) that separates data points of different classes. The goal is to maximize the margin between the classes while correctly classifying the data points.
A hyperplane is a decision boundary that separates different classes.
Support Vectors are the data points closest to the hyperplane. These points are crucial because they define the position and orientation of the decision boundary.
The margin is the distance between the hyperplane and the nearest support vectors from each class.
If the data is linearly separable, a straight hyperplane is used.
If the data is not linearly separable, SVM uses kernel functions to map the data into a higher-dimensional space where a linear separator can be applied.
The equation of the hyperplane is:
w ⋅ x - b = 0
where:
For classification:
The optimization goal is: Maximize \( \frac{1}{\|w\|} \) subject to \( y_i (w \cdot x_i - b) \geq 1 \)
SVM is a powerful and versatile algorithm, especially for classification problems. Its ability to find the optimal decision boundary makes it a popular choice in machine learning.
Would you like a practical example or code implementation? 🚀
In Support Vector Machine (SVM), support vectors are the data points closest to the decision boundary (hyperplane). These points are crucial because they determine the position and orientation of the hyperplane and influence the margin between classes.
In Support Vector Machine (SVM), support vectors are the data points closest to the decision boundary (hyperplane). These points are crucial because they determine the position and orientation of the hyperplane and influence the margin between classes.
The decision boundary in SVM is represented as:
w · x - b = 0
where:
The support vectors satisfy:
w · xi - b = ±1
The diagram below represents the support vectors, the margin, and the decision boundary:
Let,
$$π : Wᵀ . X + b = 0$$
$$π⁺ : Wᵀ . X + b = 1 $$
$$π⁻ : Wᵀ . X + b = -1$$
$$||2 \text{ planes}|| = \frac{\text{difference of vector components}}{||w||} = \frac{|W^T \cdot x + b - (W^T \cdot x + b+1)|}{||w||} = \frac{|b-1|}{||w||}$$
$$margin(d) = 2 / ||W||$$
$$W*, b* = arg max(W, b) (2 / ||W||)$$
$$yᵢ * (Wᵀ xᵢ + b) ≥ 1 ∀ xᵢ$$
| yᵢ | Support vector | Wᵀ xᵢ + b | yᵢ (Wᵀ xᵢ + b) |
|---|---|---|---|
| +1 | Yes | >1 | >1 |
| -1 | Yes | < -1 | >1 |
It is not always possible for all points of the same class to lie on one side of the margin. Some points may exist between π⁺ and π⁻. This case is known as Hard Margin SVM.
For every point xᵢ, we introduce a variable ζᵢ such that:
yᵢ (Wᵀ xᵢ + b) ≥ 1 - ζᵢ, ζᵢ ≥ 0
| xᵢ | yᵢ | Wᵀ xᵢ + b | yᵢ (Wᵀ xᵢ + b) | ζᵢ | Remark |
|---|---|---|---|---|---|
| x₁ | +1 | -0.5 | -0.5 | 1.5 | Misclassified |
| x₂ | +1 | 1.6 | 1.6 | 0 | Correct |
| x₃ | +1 | 0.5 | 0.5 | 1.5 | Misclassified |
| x₄ | -1 | -1.5 | -1.5 | 0 | Correct |
$$W*, b* = arg max(W, b) (2 / ||W||)$$
$$W*, b* = arg min(W, b) (||W||² / 2 + C ∑ ζᵢ)$$
Regularizer + Hyperparameter + Loss
$$(1/n) ∑ ζᵢ = Average-distance-of-misclassified-points$$
In SVM, we try to find a hyperplane that maximizes the margin.
Assume three hyperplanes, namely \( \pi, \pi^+, \pi^- \), such that:
Let this hyperplane be \( W^T X + b \), where \( W \) is perpendicular to the hyperplane.
For \( \pi^+ \): \( W^T X + b = 1 \) \( \Rightarrow W^T X + b - 1 = 0 \)
For \( \pi^- \): \( W^T X + b = -1 \) \( \Rightarrow W^T X + b + 1 = 0 \)
Margin Calculation:
\[
\text{Margin} = \frac{\text{difference of vector components} ||w||}{||W||} = \frac{(W^T X + b -1) -
(W^T X + b+1)}{||W||} = \frac{| -2 |}{||W||}
\]
Objective: Find \( (W^*, b^*) = \arg \max \frac{2}{||W||} \)
Constraints:
If the points are linearly separable, the hyperplane can distinguish them. However, if an outlier is introduced, the hyperplane may fail to separate them. This type of SVM is called a Hard Margin SVM because it has strict constraints to correctly classify each and every data point.
Kernel Function in SVM
A kernel function is a way of computing the dot product of two vectors \( x \) and \( y \) in some (possibly very high-dimensional) feature space. This is why kernel functions are sometimes called "generalized dot products."
The kernel trick is a method of using a linear classifier to solve a non-linear problem by transforming non-linear data into a higher-dimensional space where it becomes linearly separable.
As per the soft margin SVM formulation, the goal is to maximize the margin while allowing some misclassifications:
\[ w^*, b^* = \arg\max_{w, b} \frac{2}{\|w\|} = \arg\min_{w, b} \frac{\|w\|^2}{2} \]
The optimization problem with slack variables \( \zeta_i \) is:
\[ w^*, b^* = \arg\min_w \frac{\|w\|^2}{2} + C \sum_{i=1}^{n} \zeta_i, \] subject to \( y_i (w^T x_i + b) \geq 1 - \zeta_i \) for all points, where \( \zeta_i \geq 0 \).
Here, \( C \) is a hyperparameter that controls the trade-off between maximizing the margin and minimizing misclassification:
The alternative method is the dual form of SVM, which uses Lagrange multipliers to solve the constrained optimization problem:
\[ \max_{\alpha_i} \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j (x_i^T x_j) \]
Subject to:
- For every \( x_i \), there is a corresponding \( \alpha_i \). - \( x_i \) appears in the form of the dot product \( x_i^T x_j \). - \( x_i \) and \( x_j \) are data points from the training dataset. - Only points with \( \alpha_i > 0 \) are support vectors, while non-support vectors have \( \alpha_i = 0 \).
Since the dual form depends only on the dot product \( x_i^T x_j \), we can replace it with a kernel function \( K(x_i, x_j) \) to enable SVMs to handle non-linearly separable data.
Some common kernel functions:
The kernel trick effectively transforms the data into a higher-dimensional space where it becomes linearly separable, without explicitly computing the transformation.
Suppose we are given a dataset containing two classes and asked to build a classifier. This dataset can be perfectly separated using a hyperplane. However, there are infinitely many such separating hyperplanes. We need a reasonable way to decide which hyperplane to use as a classifier.
Support Vector Machine (SVM) is a type of classifier that finds the optimal hyperplane that best separates the positive and negative classes. The key idea behind SVM is to maximize the margin between the two classes while maintaining classification accuracy.
The margin is the distance between the separating hyperplane and the closest data points from either class (these closest points are called support vectors).
The optimal hyperplane is the one that maximizes this margin, ensuring the classifier is:
Given a dataset \( (x_i, y_i) \) where \( y_i \in \{-1, 1\} \), the decision boundary is defined by:
\[ w \cdot x + b = 0 \]
The margin is defined as:
\[ \frac{2}{\|w\|} \]
To find the optimal hyperplane, we solve the following optimization problem:
\[ \min_{w, b} \frac{\|w\|^2}{2} \] Subject to: \[ y_i (w \cdot x_i + b) \geq 1, \quad \forall i \]
This ensures that the margin is maximized while correctly classifying all data points.
Since SVM explicitly aims to maximize the margin while maintaining correct classification, it is called a Maximum Margin Classifier.
Yes, Support Vector Machine (SVM) is affected by outliers, but its impact depends on the type of SVM used.
🚫 Hard Margin SVM is highly affected by outliers.
✅ Soft Margin SVM is less affected by outliers compared to Hard Margin SVM.
⚖ Effect of outliers depends on the kernel and regularization parameters.
| Aspect | KNN | Kernel SVM |
|---|---|---|
| Type | Instance-based (Lazy learning) | Model-based (Optimized decision boundary) |
| Computational Cost | High at test time (O(n)) | High at training time (O(n²) or more) |
| Decision Boundary | Flexible but noisy for small k | Smooth and well-regularized |
| Handling High-Dimensional Data | Struggles with high dimensions (curse of dimensionality) | Performs well with appropriate kernels |
| Outlier Sensitivity | Highly affected | Can be tuned using soft margin |
Kernel SVM can be seen as a more sophisticated and optimized version of KNN, where instead of directly using distances, we transform the space using kernels to define better decision boundaries.
In fact, some research suggests that Kernel SVM with an RBF kernel behaves similarly to KNN but in a more controlled manner by learning an optimal distance metric rather than relying on raw Euclidean distances.
Stochastic Gradient Descent (SGD) is an optimization algorithm used for training machine learning models, especially in linear classifiers like SVM (Support Vector Machine). Hinge Loss is the loss function used in SVM, which penalizes misclassified points.
Hinge loss is defined as:
\( L(y, f(x)) = \max(0, 1 - y f(x)) \)
where:
Below is a diagram illustrating hinge loss:
Key Observations:
SGD with hinge loss is commonly used in SVM and linear classifiers. The choice of hyperparameters, such as learning rate and regularization, plays a crucial role in model performance. Hinge loss helps in maximizing the margin and improving classification robustness.
Kernel SVM is computationally expensive because:
SGD Classifier with Hinge Loss has lower latency because:
SGD Classifier with Hinge Loss is preferred for large-scale, linear problems due to its lower latency, while Kernel SVM is better for non-linear problems but is computationally expensive.
For an SVM classifier with a linear kernel, feature importance can be determined using the weight vector \( w \):
Feature selection techniques like Forward Feature Selection can be used to identify important features iteratively:
For SVM with non-linear kernels (e.g., RBF, polynomial):
Although feature importance is not directly available for non-linear SVM, alternative methods include:
Feature importance in SVM is straightforward for a linear kernel (using weights \( w \)), but for non-linear kernels, indirect methods like PCA, permutation importance, or SHAP are required.
In Support Vector Machines (SVM), the alpha values (\(\alpha_i\)) are the Lagrange multipliers that determine the contribution of each data point to the decision boundary.
In the dual formulation of SVM optimization:
\( L(\alpha) = \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i, x_j) \)
Subject to:
\( \sum_{i=1}^{n} \alpha_i y_i = 0, \quad 0 \leq \alpha_i \leq C \)
Only the support vectors have nonzero \(\alpha_i\), while non-support vectors have \(\alpha_i = 0\), meaning they do not contribute to the final decision function.
Kernels are mathematical functions that transform data into higher dimensions to make it linearly separable. The choice of the kernel function significantly affects the performance of an SVM model.
The simplest kernel, used when data is already linearly separable.
Formula: \( K(x, y) = x \cdot y \)
Maps input features into a higher-dimensional space using polynomial transformation.
Formula: \( K(x, y) = (x \cdot y + c)^d \), where \( d \) is the degree of the polynomial.
The most commonly used kernel. It maps data into an infinite-dimensional space.
Formula: \( K(x, y) = \exp(-\gamma ||x - y||^2) \)
Inspired by neural networks, it behaves like an activation function.
Formula: \( K(x, y) = \tanh(\alpha x \cdot y + c) \)
Custom similarity functions can be designed based on domain knowledge.
Mathematical Representation:
\( Z_i = y_i f(x_i) = y_i (W^T x_i + b) \)
0-1 Loss:
Hinge Loss:
Final Expression for Hinge Loss:
\( \xi_i = \max(0, 1 - Z_i) \)
The graph shows different loss functions:
Hinge loss is not differentiable at \( Z_i = 1 \) because it has a sharp corner at that point. The hinge loss function is given by:
\[ \text{Hinge loss} = \max(0, 1 - Z_i) \]
This function has a kink (nondifferentiable point) at \( Z_i = 1 \), where the derivative changes abruptly. However, it is still subdifferentiable, meaning we can define a subgradient, which allows it to be used in optimization.
Yes, Stochastic Gradient Descent (SGD) can be used with hinge loss because it relies on subgradients instead of strict differentiability.
The subgradient of hinge loss is:
\[ \frac{dL}{dZ_i} = \begin{cases} 0, & \text{if } Z_i \geq 1 \\ -y_i, & \text{if } Z_i < 1 \end{cases} \]
This piecewise function allows SGD to update the weights even though hinge loss is not fully differentiable.
Sklearn’s `SGDClassifier` uses hinge loss by default when `loss='hinge'`, making it a common approach for linear SVMs.