# Principal Component Analysis
Slides: [[lecture10_pca.pdf]]
## PCA
- *Input*: $X\in \mathbb{R}_{}^{d\times n}$. $n$ data points in $d$-dimensional space
- *Output*: $Z\in \mathbb{R}_{}^{k\times n}$. $n$ data points in $k$-dimensional space. $k\ll d$
*Example*. 2D to 1D PCA. Rotate $X\mapsto p_1 X$ and choose $p_1\in \mathbb{R}_{}^{d}$ such that the variance along the projection is maximized.
$\begin{align*}
p_1&= \arg \max_{|w|=1}\sum\limits_{i=1}^n\big(w^TX_i-\mathbb{E}(w^TX_i)\big)^2
= \arg\max_{|w|=1}\sum\limits_{i=1}^n(w^TX_i-w^T\bar{X_i})^2\\
&= \arg\max_{|w|=1}\sum\limits_{i=1}^nw^T(X_i-\bar{X_i})(X_i-\bar{X_i})w\\
&= \arg\max_{|w|=1}w^T(X-\bar{X})(X-\bar{X})^Tw\\
&= \arg\max_{|w|=1}w^T\text{Cov(X)}w\\
\\
\end{align*}$Since $\text{Cov}(X)$ is PSD, we can formulate the above as Lagrange dual of a cvx. op. problem:
$\begin{align*}
\min_w L&= \min_{w} w^T\text{Cov}(w)w-\lambda w^Tw\\
\frac{\partial L}{\partial w}&= 2\text{Cov}(X)w-\lambda w=0\\
p_1&= \text{1st eigenvector of Cov}(X)
\end{align*}$
## t-SNE: t-Distributed Stochastic Neighbor Embedding
[Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using tSNE." Journal of machine learning research 9.Nov (2008): 2579-2605.]
- *Input*: A set of points in a high-dimensional space
- *Goal*: Find a faithful non-linear dimensionality reduction.
- Nonlinearity -> It adapts to the underlying data
- Different transformations applied to different regions
Model similarity of points in high-dimensional space as the *conditional probability* that a point $i$ would choose point $j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $i$
$\begin{align*}
\begin{array}{c|c}
\text{Similarity in original space}&\text{Similarity in lower-dim space}\\\hline
p_{j \mid i}=\frac{\exp \left(-\left\|x_i-x_j\right\|^2 / 2 \sigma_i^2\right)}{\sum_{k \neq i} \exp \left(-\left\|x_i-x_k\right\|^2 / 2 \sigma_i^2\right)} &q_{j \mid i}=\frac{\exp \left(-\left\|y_i-y_j\right\|^2\right)}{\sum_{k \neq i} \exp \left(-\left\|y_i-y_k\right\|^2\right)}
\end{array}
\end{align*}$
Minimizes the difference between $p,q$, measured in Kullback-Leibler divergence:
$\begin{align*}
C=\sum_i K L\left(P_i|| Q_i\right)=\sum_i \sum_j p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}}
\end{align*}$
Hyperparameters:
- Num iterations for optimization algorithm
- $\sigma_i$ for each data point (way to many hyperparms) -> Observation:
- Any particular value of $\sigma_i$ induces a probability distribution, $P_i$, over all of the other datapoints
- When $\sigma_i$ increases, the entropy $\mathrm{H}\left(P_i\right)$ increases
- Makeup: t-SNE performs a *binary search for the value of $\sigma_i$ given the $\mathrm{H}\left(P_i\right)$*
- *Perplexity*: $2^{\mathrm{H}\left(P_i\right)}$