# Principal Component Analysis Slides: [[lecture10_pca.pdf]] ## PCA - *Input*: $X\in \mathbb{R}_{}^{d\times n}$. $n$ data points in $d$-dimensional space - *Output*: $Z\in \mathbb{R}_{}^{k\times n}$. $n$ data points in $k$-dimensional space. $k\ll d$ *Example*. 2D to 1D PCA. Rotate $X\mapsto p_1 X$ and choose $p_1\in \mathbb{R}_{}^{d}$ such that the variance along the projection is maximized. $\begin{align*} p_1&= \arg \max_{|w|=1}\sum\limits_{i=1}^n\big(w^TX_i-\mathbb{E}(w^TX_i)\big)^2 = \arg\max_{|w|=1}\sum\limits_{i=1}^n(w^TX_i-w^T\bar{X_i})^2\\ &= \arg\max_{|w|=1}\sum\limits_{i=1}^nw^T(X_i-\bar{X_i})(X_i-\bar{X_i})w\\ &= \arg\max_{|w|=1}w^T(X-\bar{X})(X-\bar{X})^Tw\\ &= \arg\max_{|w|=1}w^T\text{Cov(X)}w\\ \\ \end{align*}$Since $\text{Cov}(X)$ is PSD, we can formulate the above as Lagrange dual of a cvx. op. problem: $\begin{align*} \min_w L&= \min_{w} w^T\text{Cov}(w)w-\lambda w^Tw\\ \frac{\partial L}{\partial w}&= 2\text{Cov}(X)w-\lambda w=0\\ p_1&= \text{1st eigenvector of Cov}(X) \end{align*}$ ## t-SNE: t-Distributed Stochastic Neighbor Embedding [Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using tSNE." Journal of machine learning research 9.Nov (2008): 2579-2605.] - *Input*: A set of points in a high-dimensional space - *Goal*: Find a faithful non-linear dimensionality reduction. - Nonlinearity -> It adapts to the underlying data - Different transformations applied to different regions Model similarity of points in high-dimensional space as the *conditional probability* that a point $i$ would choose point $j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $i$ $\begin{align*} \begin{array}{c|c} \text{Similarity in original space}&\text{Similarity in lower-dim space}\\\hline p_{j \mid i}=\frac{\exp \left(-\left\|x_i-x_j\right\|^2 / 2 \sigma_i^2\right)}{\sum_{k \neq i} \exp \left(-\left\|x_i-x_k\right\|^2 / 2 \sigma_i^2\right)} &q_{j \mid i}=\frac{\exp \left(-\left\|y_i-y_j\right\|^2\right)}{\sum_{k \neq i} \exp \left(-\left\|y_i-y_k\right\|^2\right)} \end{array} \end{align*}$ Minimizes the difference between $p,q$, measured in Kullback-Leibler divergence: $\begin{align*} C=\sum_i K L\left(P_i|| Q_i\right)=\sum_i \sum_j p_{j \mid i} \log \frac{p_{j \mid i}}{q_{j \mid i}} \end{align*}$ Hyperparameters: - Num iterations for optimization algorithm - $\sigma_i$ for each data point (way to many hyperparms) -> Observation: - Any particular value of $\sigma_i$ induces a probability distribution, $P_i$, over all of the other datapoints - When $\sigma_i$ increases, the entropy $\mathrm{H}\left(P_i\right)$ increases - Makeup: t-SNE performs a *binary search for the value of $\sigma_i$ given the $\mathrm{H}\left(P_i\right)$* - *Perplexity*: $2^{\mathrm{H}\left(P_i\right)}$