# Norm, Nearest Neighbor, and Optimization ## Minkowski Distance ($L_p$ norm) $d(i, j)=\sqrt[p]{\left|x_{i 1}-x_{j 1}\right|^p+\left|x_{i 2}-x_{j 2}\right|^p+\cdots+\left|x_{i l}-x_{j l}\right|^p}$ where $i=(x_{i1},\cdots x_{in}), j=(x_{j1}, \cdots, x_{jn})$ are two n-dimensional data objects, $p$ is the order. ```start-multi-column ID: ID_kxhn Number of Columns: 2 Largest Column: standard ``` Properties for a *metric*: - *Positivity*: $d(i,j)>0, \forall i\neq j$, and $d(i,j)=0 \text{ iff } i=j$ - *Symmetry*: $d(i,j)=d(j,i)$ - *Triangle inequality*: $d(i,j)\leq d(i,k)+d(k,j)$ --- column-end --- $L_p$ norms: - $L_0$: Number of non-zero entries - $L_1$: Manhattan distance - $L_2$: Euclidean distance - $L_\infty$: Max of the absolute value === end-multi-column ### Cosine similarity Cosine similarity is equivalent to $L_2$ distance after normalization (TODO) $\begin{align*} \text{Cos similarity: }&\frac{x^Ty}{\Vert x \Vert_2\cdot \Vert y\Vert_2}\\ \text{L2 distance: }&\Vert x-y \Vert_2^2=x^Tx+y^Ty-2x^Ty \end{align*}$ ## Gradient **Gradient** is the derivative of $f(\mathbf{x})$ w.r.t. $\mathbf{x}$: $\begin{align*} \nabla_x f(x)=\frac{\partial f}{\partial \mathbf{x}}&= [\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2},\cdots,\frac{\partial f}{\partial x_d}]^T \end{align*}$ #example $\frac{\partial }{\partial \mathbf{x}}\frac{1}{2}(\mathbf{w}^\top \mathbf{x}-b)^2=\mathbf{w}(\mathbf{w}^\top \mathbf{x}+b)$: $\begin{align*} f(\mathbf{x})&= \frac{1}{2}\left(\sum\limits_{i}w_ix_i-b\right)^2=\frac{1}{2}\left(w_1x_1+\sum\limits_{i=2}^d w_ix_i-b\right)^2\\ \frac{\partial f(\mathbf{x})}{\partial x_i}&= w_i(\mathbf{w}^\top \mathbf{x}+b) \;\rightarrow\; \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}= \boxed{\mathbf{w}(\mathbf{w}^\top \mathbf{x}+b)} \end{align*}$ ## Convexity A function $f: \mathbb{R}_{}^{d} \rightarrow \mathbb{R}_{}^{}$ is **convex** if $\forall \mathbf{a,b}\in \mathbb{R}_{}^{d}$, $\theta\in[0,1]$: $f(\theta \mathbf{a}, (1-\theta)\mathbf{b})\leq \theta f(\mathbf{a}) + (1-\theta)f(\mathbf{b})$ **First-order (gradient) condition** for convexity: $\nabla f(x)= \begin{bmatrix}\partial f / \partial x_1\\\vdots\\\partial f / \partial x_d\end{bmatrix}, \quad f(y) \geq f(x)+\nabla f(x)^T(y-x)$ **Second-order (Hessian) condition** for convexity: $\begin{align*} \nabla^2 f(x) =\begin{bmatrix} \frac{\partial^2 f}{\partial x_{1} \partial x_{1}}&\cdots &\frac{\partial^2 f}{\partial x_{1} \partial x_{d}} \\ \vdots & \ddots &\vdots\\ \frac{\partial^2 f}{\partial x_{d} \partial x_{1}} &\cdots &\frac{\partial^2 f}{\partial x_{d} \partial x_{d}} \end{bmatrix} \succcurlyeq0 \end{align*}$ NOTE: From now on, assume every function is vector-valued, so vectors are not bolded. #example Prove $f(x)=x^Tx$ is convex. $\begin{align*} f(x)&= x^Tx=x^TIx \quad\rightarrow\nabla^2 f(x)= 2I \succcurlyeq0 \end{align*}$ ## Stochastic Gradient Descent (SGD) Let $\set{(\mathbf{x}^{(i)}, \mathbf{y}^{(i)})}$ be data points and labels and $\mathbf{\theta}$ be learnable model parameters. **Goal:** Minimize the loss function $\begin{align*} \min L(\mathbf{\theta}\;;\; \mathbf{x}^{(i)},\mathbf{y}^{(i)})=\min \sum\limits_{i=1}^Nl\Big(f\big(\mathbf{x}^{(i)}, \mathbf{\theta}\big), y^{(i)}\Big) \end{align*}$ **GD Optimization procedure**: 1. Initialize $\theta^{(0)}$ randomly at $t=0$ 2. While $\nabla L(\mathbf{\theta^{t}})\neq0$: - $\mathbf{\theta^(t+1)}=\mathbf{\theta}^{t}-\eta^{(t)} \nabla L(\mathbf{\theta^{(t)}})$ ($\eta$ is learning rate) - $t \coloneqq t+1$ **GD Cons**: - Could lead to local minimum - $\nabla L(\mathbf{\theta^{(t)}})$ bottlenecks the procedure - every training data point is needed. **SGD**: Sample a batch of training data to estimate $\nabla L(\mathbf{\theta}^{(t)})$ - In expectation it becomes identical the actual $\nabla L(\theta^{(t)})$