# Norm, Nearest Neighbor, and Optimization
## Minkowski Distance ($L_p$ norm)
$d(i, j)=\sqrt[p]{\left|x_{i 1}-x_{j 1}\right|^p+\left|x_{i 2}-x_{j 2}\right|^p+\cdots+\left|x_{i l}-x_{j l}\right|^p}$
where $i=(x_{i1},\cdots x_{in}), j=(x_{j1}, \cdots, x_{jn})$ are two n-dimensional data objects, $p$ is the order.
```start-multi-column
ID: ID_kxhn
Number of Columns: 2
Largest Column: standard
```
Properties for a *metric*:
- *Positivity*: $d(i,j)>0, \forall i\neq j$, and $d(i,j)=0 \text{ iff } i=j$
- *Symmetry*: $d(i,j)=d(j,i)$
- *Triangle inequality*: $d(i,j)\leq d(i,k)+d(k,j)$
--- column-end ---
$L_p$ norms:
- $L_0$: Number of non-zero entries
- $L_1$: Manhattan distance
- $L_2$: Euclidean distance
- $L_\infty$: Max of the absolute value
=== end-multi-column
### Cosine similarity
Cosine similarity is equivalent to $L_2$ distance after normalization (TODO)
$\begin{align*}
\text{Cos similarity: }&\frac{x^Ty}{\Vert x \Vert_2\cdot \Vert y\Vert_2}\\
\text{L2 distance: }&\Vert x-y \Vert_2^2=x^Tx+y^Ty-2x^Ty
\end{align*}$
## Gradient
**Gradient** is the derivative of $f(\mathbf{x})$ w.r.t. $\mathbf{x}$:
$\begin{align*}
\nabla_x f(x)=\frac{\partial f}{\partial \mathbf{x}}&= [\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2},\cdots,\frac{\partial f}{\partial x_d}]^T
\end{align*}$
#example $\frac{\partial }{\partial \mathbf{x}}\frac{1}{2}(\mathbf{w}^\top \mathbf{x}-b)^2=\mathbf{w}(\mathbf{w}^\top \mathbf{x}+b)$:
$\begin{align*}
f(\mathbf{x})&= \frac{1}{2}\left(\sum\limits_{i}w_ix_i-b\right)^2=\frac{1}{2}\left(w_1x_1+\sum\limits_{i=2}^d w_ix_i-b\right)^2\\
\frac{\partial f(\mathbf{x})}{\partial x_i}&= w_i(\mathbf{w}^\top \mathbf{x}+b) \;\rightarrow\;
\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}}= \boxed{\mathbf{w}(\mathbf{w}^\top \mathbf{x}+b)}
\end{align*}$
## Convexity
A function $f: \mathbb{R}_{}^{d} \rightarrow \mathbb{R}_{}^{}$ is **convex** if $\forall \mathbf{a,b}\in \mathbb{R}_{}^{d}$, $\theta\in[0,1]$:
$f(\theta \mathbf{a}, (1-\theta)\mathbf{b})\leq \theta f(\mathbf{a}) + (1-\theta)f(\mathbf{b})$
**First-order (gradient) condition** for convexity:
$\nabla f(x)= \begin{bmatrix}\partial f / \partial x_1\\\vdots\\\partial f / \partial x_d\end{bmatrix}, \quad f(y) \geq f(x)+\nabla f(x)^T(y-x)$
**Second-order (Hessian) condition** for convexity:
$\begin{align*}
\nabla^2 f(x) =\begin{bmatrix}
\frac{\partial^2 f}{\partial x_{1} \partial x_{1}}&\cdots &\frac{\partial^2 f}{\partial x_{1} \partial x_{d}} \\
\vdots & \ddots &\vdots\\
\frac{\partial^2 f}{\partial x_{d} \partial x_{1}} &\cdots &\frac{\partial^2 f}{\partial x_{d} \partial x_{d}}
\end{bmatrix} \succcurlyeq0
\end{align*}$
NOTE: From now on, assume every function is vector-valued, so vectors are not bolded.
#example Prove $f(x)=x^Tx$ is convex.
$\begin{align*}
f(x)&= x^Tx=x^TIx \quad\rightarrow\nabla^2 f(x)= 2I \succcurlyeq0
\end{align*}$
## Stochastic Gradient Descent (SGD)
Let $\set{(\mathbf{x}^{(i)}, \mathbf{y}^{(i)})}$ be data points and labels and $\mathbf{\theta}$ be learnable model parameters.
**Goal:** Minimize the loss function
$\begin{align*}
\min L(\mathbf{\theta}\;;\; \mathbf{x}^{(i)},\mathbf{y}^{(i)})=\min \sum\limits_{i=1}^Nl\Big(f\big(\mathbf{x}^{(i)}, \mathbf{\theta}\big), y^{(i)}\Big)
\end{align*}$
**GD Optimization procedure**:
1. Initialize $\theta^{(0)}$ randomly at $t=0$
2. While $\nabla L(\mathbf{\theta^{t}})\neq0$:
- $\mathbf{\theta^(t+1)}=\mathbf{\theta}^{t}-\eta^{(t)} \nabla L(\mathbf{\theta^{(t)}})$ ($\eta$ is learning rate)
- $t \coloneqq t+1$
**GD Cons**:
- Could lead to local minimum
- $\nabla L(\mathbf{\theta^{(t)}})$ bottlenecks the procedure - every training data point is needed.
**SGD**: Sample a batch of training data to estimate $\nabla L(\mathbf{\theta}^{(t)})$
- In expectation it becomes identical the actual $\nabla L(\theta^{(t)})$