# Support Vector Machine Slides: [[lecture6_svm.pdf|SVM1]], [[lecture7_svm_dual_and_kernel.pdf|SVM2]] ## 1 Hard and Soft SVM ![[Pasted image 20230425080158.png|300]] ### 1.1 Hard SVM Intuition: - *Maximum Margin*: The shortest distances between data points of two types along the direction of the normal vector of the decision boundary - *Support vector*: The datapoints that achieve the maximum margin. - *Optimal hyperplane*: The hyperplane w/ maximum margin is the best. (defined solely by support vector -> robust to noise). **Formulation**: Let $w^Tx-b=0$ be the optimal hyperplane. - *Constraint*: Want all datapoints classified perfectly, i.e. $y_i(w^Tx_i-b)\geq1,\;\forall y_i\in Y$ - Equivalent to maximize the margin $2/\Vert w \Vert_2$ - Equivalent to minimize $\Vert w \Vert_2^2$ - Requires points being *linearly separable*. ### 1.2 Soft SVM with Hinge Loss ***Hinge Loss***: $L_{\text{hinge}}=\max\big(0, 1-y_i(w^Tx_i-b)\big)$ - Similar to ReLU activation - $L_{\text{hinge}}=0$ when *hard constraint* is met **Formulation**: $\begin{align*} \min \lambda \Vert w \Vert_2^2+\sum\limits_{i=1}^n\max\big(0, 1-y_i(w^Tx_i-b)\big) \end{align*}$ - First term maximizes margin size - Second term encourage data lie on the correct side of the margin - If **$\lambda \rightarrow 0$ it will behave similar to hard SVM** (counter intuitive) #question Is this equivalent to the *Lagrange dual problem* of the optimization problem for *Hard SVM*, if we think of the first term as the objective function, and second term as the constraint. #example ```python model = sklearn.svm.SVC(kernel='linear') ``` ![[4_svm 2023-04-25 08.29.11.excalidraw.svg|600]] ## 2 Dual Form ### 2.1 Dual of Hard SVM *Primal formulation* for hard SVM: $\begin{align*} \min \frac{1}{2}\Vert w \Vert_2^2 \quad\text{s.t. } y_i(w^Tx_i-b)\ge1\;\forall i \end{align*}$ *Lagrange dual formulation for hard SVM*: $\begin{align*} \min_{w,b}\max_{a,a_i\geq0}& \frac{1}{2}\Vert w \Vert_2^2-\sum\limits_{i=1}^na_i\big(y_i(w^Tx_i-b)-1\big)\\ \underset{\text{Strong duality}}{\equiv} \max_{a,a_i\geq0}\min_{w,b}& \frac{1}{2}\Vert w \Vert_2^2-\sum\limits_{i=1}^na_i\big(y_i(w^Tx_i-b)-1\big) \end{align*}$ - The objective function is convex, therefore by the first-order condition: $\begin{align*} \frac{\partial L}{\partial w}&=w-\sum\limits_{i=1}^na_iy_ix_i=0 &\rightarrow w=\sum\limits_{i=1}^n a_iy_i\mathbf{x}_i\\ \frac{\partial L}{\partial b}&= \sum\limits_{i=1}^na_iy_i=0 &\rightarrow \sum\limits_{i=1}^na_iy_i=0 \end{align*}$ - Plugging $w$ back, we have $\begin{align*} & \boldsymbol{w}^T \boldsymbol{x}_{\boldsymbol{i}}-b=\left(\sum_{j=1}^n \alpha_j y_j \boldsymbol{x}_{\boldsymbol{j}}\right)^T \boldsymbol{x}_{\boldsymbol{i}}-b=\sum_{j=1}^n \alpha_j y_j\left(\boldsymbol{x}_{\boldsymbol{j}}^T \boldsymbol{x}_{\boldsymbol{i}}\right)-b \\ & \frac{1}{2}|\boldsymbol{w}|_2^2=\frac{1}{2} \boldsymbol{w}^T \boldsymbol{w}=\frac{1}{2}\left(\sum_{j=1}^n \alpha_j y_j \boldsymbol{x}_{\boldsymbol{j}}\right)^T\left(\sum_{k=1}^n \alpha_k y_k \boldsymbol{x}_{\boldsymbol{k}}\right) \\ & =\frac{1}{2} \sum_{j, k} \alpha_j \alpha_k y_j y_k\left(\boldsymbol{x}_j^T \boldsymbol{x}_k\right) \end{align*}$ - Reformulation the original objective function, we have $\begin{align*} &\max _{\boldsymbol{\alpha}, \alpha_i \geq 0} \min _{\boldsymbol{w}, \boldsymbol{b}} \frac{1}{2}|\boldsymbol{w}|_2^2-\sum_{i=1}^n \alpha_i\left(y_i\left(\boldsymbol{w}^T \boldsymbol{x}_{\boldsymbol{i}}-b\right)-1\right)\\ \equiv&\max _{\alpha, \alpha_i \geq 0} \frac{1}{2} \sum_{j, k} \alpha_j \alpha_k y_j y_k\left(\boldsymbol{x}_j^T \boldsymbol{x}_k\right)-\sum_{i=1}^n \alpha_i\left(y_i\left(\sum_{j=1}^n \alpha_j y_j\left(\boldsymbol{x}_{\boldsymbol{j}}^T \boldsymbol{x}_{\boldsymbol{i}}\right)-b\right)-1\right)\\ \equiv&\max _{\boldsymbol{\alpha}, \alpha_i \geq 0} \frac{1}{2} \sum_{j, k} \alpha_j \alpha_k y_j y_k\left(\boldsymbol{x}_j^T \boldsymbol{x}_k\right)-\sum_{i, j=1}^n \alpha_i y_i \alpha_j y_j\left(\boldsymbol{x}_{\boldsymbol{j}}^T \boldsymbol{x}_{\boldsymbol{i}}\right)+\sum_{i=1}^n \alpha_i\\ \equiv&\boxed{\max _{\boldsymbol{\alpha}} \sum_{i=1}^n \alpha_i-\frac{1}{2} \sum_{i, j} \alpha_i \alpha_j y_i y_j\left(\boldsymbol{x}_i^T \boldsymbol{x}_j\right) \quad\text{s.t. } a_i\geq0,\text{ and } \sum\limits _{i=1}^n a_iy_i=0} \end{align*}$ ### 2.2 Dual of Soft SVM *Primal*: $\begin{align*} \min\lambda|\boldsymbol{w}|_2^2+\frac{1}{n} \sum_{i=1}^n \max \left(0,1-y_i\left(\boldsymbol{w}^T \boldsymbol{x}_{\boldsymbol{i}}-b\right)\right) \end{align*}$ *Dual*: $\begin{align*} \max _{\boldsymbol{\alpha}} \sum_i \alpha_i-\frac{1}{2} \sum_{i, i} y_i \alpha_i\left(\boldsymbol{x}_{\boldsymbol{i}} \cdot \boldsymbol{x}_{\boldsymbol{j}}\right) y_j \alpha_j \\\quad\text{s.t. } 0 \leq \alpha_i \leq \frac{1}{2 n \lambda},\;\sum_i \alpha_i y_i=0 \end{align*}$ ## 3 Kernel SVM Recall the dual for soft SVM: $\max _{\boldsymbol{\alpha}} \sum_i \alpha_i-\frac{1}{2} \sum_{i, i} y_i \alpha_i\left(\boxed{\boldsymbol{x}_{\boldsymbol{i}} \cdot \boldsymbol{x}_{\boldsymbol{j}}}\right) y_j \alpha_j$ - *Kernels* are operators associating $(x_i,x_j)$. $k(x_i,x_j)=\phi(x_i)\cdot\phi(x_j)$ - e.g. Polynomial kernel $k\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right)=\left(\boldsymbol{x}_{\boldsymbol{i}} \cdot \boldsymbol{x}_{\boldsymbol{j}}\right)^d$ - e.g. *Gaussian RBF kernel* $k\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right)=\exp \left(-\gamma\left|\boldsymbol{x}_{\boldsymbol{i}}-\boldsymbol{x}_{\boldsymbol{j}}\right|_2^2\right)$ - Can rewrite ($\phi$ is a transformation function to the kernel space): $\begin{align*} x_i^Tx_j&= k(x_i,x_j)=\phi(x_i)\cdot\phi(x_j)\\ \rightarrow \boldsymbol{w}&= \sum_i a_iy_i \phi(\boldsymbol{x_i}) \end{align*}$ ### Kernel vs. explicit transfer Suppose $x_i \in \mathbb{R}_{}^{d=2}$, then: $\begin{align*} x_i&= [x_{i,1},x_{i,2}]^T\\ \phi(x_i)&= [x_{i,1}^2, x_{i,2}^2,\sqrt{2} x_{i, 1} x_{i, 2}, \sqrt{2} x_{i, 1}, \sqrt{2} x_{i, 2}, 1]^T\\ k\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right)&= \left(\boldsymbol{x}_{\boldsymbol{i}} \cdot \boldsymbol{x}_{\boldsymbol{j}}+1\right)^2 \end{align*}$ - $\phi(x_i)\in \mathbb{R}_{}^{e}$ where $e=d+\frac{d(d-1)}{2}+d+1$ - Computing $k\left(\boldsymbol{x}_{\boldsymbol{i}}, \boldsymbol{x}_{\boldsymbol{j}}\right)$ much more efficient! -