# Distance, Norm, Metric, DSA
## 1 Distance in $\mathbb{R}_{}^{m}$:
### 1.1 $L_p$ Norm
#example Consider $x\in \mathbb{R}_{}^{3}=[1,1,1]^T$
$\begin{align*}
\Vert x \Vert_1=3,\;\Vert x \Vert_2=\sqrt{3},\; \Vert x \Vert_\infty=1
\end{align*}$
#example Unit norm ball in $\mathbb{R}_{}^{2}$
![[2_proximity 2023-04-11 12.39.31.excalidraw.svg|200]]
### 1.2 Weighted $l_p$ norm
$\begin{align*}
\Vert x-x' \Vert_p=\left(\sum\limits_{i=1}^n \boxed{w_i}\;|...|^{1/p}\right)^p
\end{align*}$
#example *Metric learning*
## 2 Metric Space (a family of distance functions)
A distance function $d:\mathcal{X}\times\mathcal{X}\rightarrow \mathbb{R}$ is a **metric** iff:
- *Positivity*: $d(i,j)>0, \forall i\neq j$, and $d(i,j)=0 \text{ iff } i=j$
- *Symmetry*: $d(i,j)=d(j,i)$
- *Triangle inequality*: $d(i,j)\leq d(i,k)+d(k,j)$
#lemma: *KL divergence* is NOT a metric. Violation: 1) *Symmetry*, 2) *Triangle-inequality*
#lemma: All *norms* can induce metrics, i.e. $d(x,y)=\Vert x-y \Vert$ is always a metric if $\Vert \cdot \Vert$ is a norm. ([[cse291_unsup_note/hw/hw_2022/hw1_submission#7 Norm of Difference as Metric|proof]])
#lemma: *Hamming distance* is a metric. $d(x,y)$ = min # of insertions, deletions, substitutions needed to get from $x$ to $y$. ([[cse291_unsup_note/hw/hw_2022/hw1_submission#6 Metric|proof]])
- E.g. let $x=$[A,G,T,C,G,T], $y=$[G,A,C,G,T]. 1) Delete A in x, 2) substitute T with A. ($\text{dist}(x,y)=2$)
- What is some substitutions are more costly? -> Then $d(x,y)$ is still a metric iff the *cost matrix* is symmetric!
## 3 Similarity Functions
**Jaccard similarity**
$\begin{align*}
s(A,B)=\frac{|A\cap B|}{|A\cup B|}\; \in[0,1]
\end{align*}$
Widely used in information retrieval (e.g. web search). Maximized when $B=A$
**Cosine similarity**
$\begin{align*}
s(x,y)=\frac{x^Tz}{\Vert x \Vert_2\Vert z \Vert_2},\; \in[-1,1]\sim \cos(\theta)
\end{align*}$
Maximized when $z=cx, c\in\mathbb{R}_{+}^{}$
**Dot product**
$\begin{align*}
s(x,y)=x^Tz\;\in(-\infty,\infty)
\end{align*}$
- Issue: $s(x,z)$ can be larger than $s(x,x)$ if $\Vert z \Vert$ is large enough.
- Solution: Force $x,z$ to be unit-length
**Kernel functions** (generalized dot product)
$\begin{align*}
k(x,z)=\phi(x)\cdot \phi(z)
\end{align*}$
where $\phi:\mathcal{X}\rightarrow \mathbb{R}_{}^{d}$, where $1\leq d\leq \infty$ is an embedding function.
- Issue: Needs to make to embedding takes unit vectors.
## 4 DSA for Proximity Search (spatial data structures)
[slides](https://cseweb.ucsd.edu/~dasgupta/251u/proximity-structures-handout.pdf)
**Problem**: Given $n$ data points in $\mathbb{R}_{}^{d}$, complexity of computing the NN of $q$ is $O(nd)$. Can we do better? A smarter way for $d=1$:
- 1) Sort the data: $O(n\log n)$. 2) Binary search: $O(\log n)$. 3) Final time complexity: $O(n\log n)$
**Takeaway**: Build a data structure for efficient NN computation at data point $q$.
- DS building takes $O(n)$. Query time takes $o(n)$
- Some options: *K-d tree*, *Ball tree*, *Locality-sensitive hashing*
### 4.1 $k$-d tree for vectors in $d$-dimensional vector space
A hierarchical, *rectilinear spatial partition*
#algorithm For dataset $S\subset \mathbb{R}_{}^{d}$
- ... #todo
*DS building* time complexity:
- Height of the tree: $\log(\frac{n}{n_0})$
- Median-finding time: $O(n/2)$
- Final time for building the tree: $O(n\log(\frac{n}{n_0}))$
*DS querying* time complexity (*defeatist* search for query $q$):
- Move $q$ down to its leaf-cell
- Return closets point in the cell
- Final time: $O(\log(n/n_0))+O(d\cdot n_0)$
- *Problem*: Might not return the actual NN when the actual NN is in the adjacent box.
*DS querying* time complexity (*comprehensive* search for query $q$):
- Go to $q