# Support Vector Machine
Slides: [[lecture6_svm.pdf|SVM1]], [[lecture7_svm_dual_and_kernel.pdf|SVM2]]
## 1 Hard and Soft SVM
![[Pasted image 20230425080158.png|300]]
### 1.1 Hard SVM
Intuition:
- *Maximum Margin*: The shortest distances between data points of two types along the direction of the normal vector of the decision boundary
- *Support vector*: The datapoints that achieve the maximum margin.
- *Optimal hyperplane*: The hyperplane w/ maximum margin is the best. (defined solely by support vector -> robust to noise).
**Formulation**: Let $w^Tx-b=0$ be the optimal hyperplane.
- *Constraint*: Want all datapoints classified perfectly, i.e. $y_i(w^Tx_i-b)\geq1,\;\forall y_i\in Y$
- Equivalent to maximize the margin $2/\Vert w \Vert_2$
- Equivalent to minimize $\Vert w \Vert_2^2$
- Requires points being *linearly separable*.
### 1.2 Soft SVM with Hinge Loss
***Hinge Loss***: $L_{\text{hinge}}=\max\big(0, 1-y_i(w^Tx_i-b)\big)$
- Similar to ReLU activation
- $L_{\text{hinge}}=0$ when *hard constraint* is met
**Formulation**:
$\begin{align*}
\min \lambda \Vert w \Vert_2^2+\sum\limits_{i=1}^n\max\big(0, 1-y_i(w^Tx_i-b)\big)
\end{align*}$
- First term maximizes margin size
- Second term encourage data lie on the correct side of the margin
- If **$\lambda \rightarrow 0$ it will behave similar to hard SVM** (counter intuitive)
#question Is this equivalent to the *Lagrange dual problem* of the optimization problem for *Hard SVM*, if we think of the first term as the objective function, and second term as the constraint.
#example
```python
model = sklearn.svm.SVC(kernel='linear')
```
![[4_svm 2023-04-25 08.29.11.excalidraw.svg|600]]
## 2 Dual Form
### 2.1 Dual of Hard SVM
*Primal formulation* for hard SVM:
$\begin{align*}
\min \frac{1}{2}\Vert w \Vert_2^2 \quad\text{s.t. } y_i(w^Tx_i-b)\ge1\;\forall i
\end{align*}$
*Lagrange dual formulation for hard SVM*:
$\begin{align*}
\min_{w,b}\max_{a,a_i\geq0}& \frac{1}{2}\Vert w \Vert_2^2-\sum\limits_{i=1}^na_i\big(y_i(w^Tx_i-b)-1\big)\\
\underset{\text{Strong duality}}{\equiv} \max_{a,a_i\geq0}\min_{w,b}& \frac{1}{2}\Vert w \Vert_2^2-\sum\limits_{i=1}^na_i\big(y_i(w^Tx_i-b)-1\big)
\end{align*}$
- The objective function is convex, therefore by the first-order condition:
$\begin{align*}
\frac{\partial L}{\partial w}&=w-\sum\limits_{i=1}^na_iy_ix_i=0 &\rightarrow w=\sum\limits_{i=1}^n a_iy_i\mathbf{x}_i\\
\frac{\partial L}{\partial b}&= \sum\limits_{i=1}^na_iy_i=0 &\rightarrow \sum\limits_{i=1}^na_iy_i=0
\end{align*}$
- Plugging $w$ back, we have
$\begin{align*}
& \boldsymbol{w}^T \boldsymbol{x}_{\boldsymbol{i}}-b=\left(\sum_{j=1}^n \alpha_j y_j \boldsymbol{x}_{\boldsymbol{j}}\right)^T \boldsymbol{x}_{\boldsymbol{i}}-b=\sum_{j=1}^n \alpha_j y_j\left(\boldsymbol{x}_{\boldsymbol{j}}^T \boldsymbol{x}_{\boldsymbol{i}}\right)-b \\
& \frac{1}{2}|\boldsymbol{w}|_2^2=\frac{1}{2} \boldsymbol{w}^T \boldsymbol{w}=\frac{1}{2}\left(\sum_{j=1}^n \alpha_j y_j \boldsymbol{x}_{\boldsymbol{j}}\right)^T\left(\sum_{k=1}^n \alpha_k y_k \boldsymbol{x}_{\boldsymbol{k}}\right) \\
& =\frac{1}{2} \sum_{j, k} \alpha_j \alpha_k y_j y_k\left(\boldsymbol{x}_j^T \boldsymbol{x}_k\right)
\end{align*}$
- Reformulation the original objective function, we have
$\begin{align*}
&\max _{\boldsymbol{\alpha}, \alpha_i \geq 0} \min _{\boldsymbol{w}, \boldsymbol{b}} \frac{1}{2}|\boldsymbol{w}|_2^2-\sum_{i=1}^n \alpha_i\left(y_i\left(\boldsymbol{w}^T \boldsymbol{x}_{\boldsymbol{i}}-b\right)-1\right)\\
\equiv&\max _{\alpha, \alpha_i \geq 0} \frac{1}{2} \sum_{j, k} \alpha_j \alpha_k y_j y_k\left(\boldsymbol{x}_j^T \boldsymbol{x}_k\right)-\sum_{i=1}^n \alpha_i\left(y_i\left(\sum_{j=1}^n \alpha_j y_j\left(\boldsymbol{x}_{\boldsymbol{j}}^T \boldsymbol{x}_{\boldsymbol{i}}\right)-b\right)-1\right)\\
\equiv&\max _{\boldsymbol{\alpha}, \alpha_i \geq 0} \frac{1}{2} \sum_{j, k} \alpha_j \alpha_k y_j y_k\left(\boldsymbol{x}_j^T \boldsymbol{x}_k\right)-\sum_{i, j=1}^n \alpha_i y_i \alpha_j y_j\left(\boldsymbol{x}_{\boldsymbol{j}}^T \boldsymbol{x}_{\boldsymbol{i}}\right)+\sum_{i=1}^n \alpha_i\\
\equiv&\boxed{\max _{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i, j} \alpha_i \alpha_j y_i y_j\left(\boldsymbol{x}_i^T \boldsymbol{x}_j\right) \quad\text{s.t. } a_i\geq0,\text{ and } \sum\limits _{i=1}^n a_iy_i=0}
\end{align*}$
### 2.2 Dual of Soft SVM
*Primal*:
$\begin{align*}
\min\lambda|\boldsymbol{w}|_2^2+\frac{1}{n} \sum_{i=1}^n \max \left(0,1-y_i\left(\boldsymbol{w}^T \boldsymbol{x}_{\boldsymbol{i}}-b\right)\right)
\end{align*}$
*Dual*:
$\begin{align*}
\max _{\boldsymbol{\alpha}} \sum_i \alpha_i-\frac{1}{2} \sum_{i, i} y_i \alpha_i\left(\boldsymbol{x}_{\boldsymbol{i}} \cdot \boldsymbol{x}_{\boldsymbol{j}}\right) y_j \alpha_j \\\quad\text{s.t. } 0 \leq \alpha_i \leq \frac{1}{2 n \lambda},\;\sum_i \alpha_i y_i=0
\end{align*}$
## 3 Kernel SVM
Recall the dual for soft SVM: $\max _{\boldsymbol{\alpha}} \sum_i \alpha_i-\frac{1}{2} \sum_{i, i} y_i \alpha_i\left(\boxed{\boldsymbol{x}_{\boldsymbol{i}} \cdot \boldsymbol{x}_{\boldsymbol{j}}}\right) y_j \alpha_j$
- *Kernels* are operators associating $(x_i,x_j)$. $k(x_i,x_j)=\phi(x_i)\cdot\phi(x_j)$
- e.g. Polynomial kernel $k\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right)=\left(\boldsymbol{x}_{\boldsymbol{i}} \cdot \boldsymbol{x}_{\boldsymbol{j}}\right)^d$
- e.g. *Gaussian RBF kernel* $k\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right)=\exp \left(-\gamma\left|\boldsymbol{x}_{\boldsymbol{i}}-\boldsymbol{x}_{\boldsymbol{j}}\right|_2^2\right)$
- Can rewrite ($\phi$ is a transformation function to the kernel space):
$\begin{align*}
x_i^Tx_j&= k(x_i,x_j)=\phi(x_i)\cdot\phi(x_j)\\
\rightarrow \boldsymbol{w}&= \sum_i a_iy_i \phi(\boldsymbol{x_i})
\end{align*}$
### Kernel vs. explicit transfer
Suppose $x_i \in \mathbb{R}_{}^{d=2}$, then:
$\begin{align*}
x_i&= [x_{i,1},x_{i,2}]^T\\
\phi(x_i)&= [x_{i,1}^2, x_{i,2}^2,\sqrt{2} x_{i, 1} x_{i, 2}, \sqrt{2} x_{i, 1}, \sqrt{2} x_{i, 2}, 1]^T\\
k\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right)&= \left(\boldsymbol{x}_{\boldsymbol{i}} \cdot \boldsymbol{x}_{\boldsymbol{j}}+1\right)^2
\end{align*}$
- $\phi(x_i)\in \mathbb{R}_{}^{e}$ where $e=d+\frac{d(d-1)}{2}+d+1$
- Computing $k\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right)$ much more efficient!
-