# Projection and SVD (spectral methods) Slides: [[lec7_informative_projection.pdf|informative projections]], [[lec8_svd.pdf|svd]] ## 1 Intro: Linear Projection and PCA *Motivation*: Compress MNIST: Eliminate pixels with low variance. Can discard ~300 pixels without losing overall variance $=\sum\limits_{i=1}^{789}\text{var}({x_i})$ ![[4_svd 2023-04-25 12.36.39.excalidraw.svg|400]] The ***projection*** of $x \in \mathbb{R}^d$ in the direction $u \in \mathbb{R}^d$ ($\|u\|_2=1$) is the dot product: $\begin{align*} x \cdot u=u \cdot x=u^T x=\sum_{i=1}^d u_i x_i \end{align*}$ The *direction of maximum variance* is the best direction to project: $\begin{align*} x \mapsto u \cdot x \quad \text { for some unit direction } u \in \mathbb{R}^d \end{align*}$ #lemma 1) The *variance of X in direction $u$* is given by $u^T\Sigma u$, where $\Sigma\in \mathbb{R}_{}^{d\times d}$ is the *covariance matrix* of $X$. ($\Sigma$ is always PSD). *Proof*. $\begin{align*} \Sigma&= \mathbb{E}[(X-\mu)(X-\mu )^T]\\ u^T\Sigma u&= \mathbb{E}[u^T(X-\mu)(X-\mu)^Tu]\\ &= \mathbb{E}\left[(u^T(X-\mu))^2\right]=\text{var}(u^TX) \end{align*}$ #lemma 2) $u^T\Sigma u$ is maximized when $u=$ 1st *eigenvector* of $\Sigma$ (i.e. the 1st *principal component*), and the maximum variance value is the corresponding *eigenvalue* (i.e. the *variance* in the direction of the 1st principal component). ### 1.1 Projection onto $k$-dimensional subspace Projecting $x \in \mathbb{R}^d$ into the $k$-dimensional subspace defined by vectors $u_1, \ldots, u_k \in \mathbb{R}^d$. Easiest when the $u_i$ 's are orthonormal: - They have length one. - They are at right angles to each other: $u_i \cdot u_j=0$ when $i \neq j$ - $U$ is the $d\times k$ matrix w/ columns $u_1,\cdots,u_k$ $\begin{align*} \text{Projection}&= \left(x \cdot u_1, x \cdot u_2, \ldots, x \cdot u_k\right)=\underbrace{\left(\begin{array}{c} \longleftarrow u_1 \longrightarrow \\ \vdots \\ \longleftarrow u_k \longrightarrow \end{array}\right)}_{\text {call this } U^T}\left(\begin{array}{l} \uparrow \\ x \\ \downarrow \end{array}\right)&=U^Tx\\ \text{Reconstruction}&= \left(x \cdot u_1\right) u_1+\left(x \cdot u_2\right) u_2+\cdots+\left(x \cdot u_k\right) u_k&=U U^T x \end{align*}$ ### 1.2 PCA ![[4_svd 2023-04-25 13.10.30.excalidraw.svg|700]] Let $\Sigma$ be the $d \times d$ covariance matrix of $X$. In $O\left(d^3\right)$ time, we can compute its *eigendecomposition*, consisting of - Real *eigenvalues* $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d$ - Corresponding *eigenvectors* $u_1, \ldots, u_d \in \mathbb{R}^d$ that are orthonormal (unit length and at right angles to each other) ***Principal component analysis***: Suppose we want to map data $X \in \mathbb{R}^d$ to just $k$ dimensions, while capturing as much of the variance of $X$ as possible. The best choice of projection is: $ x \mapsto\left(u_1 \cdot x, u_2 \cdot x, \ldots, u_k \cdot x\right)=U^Tx $ where $u_i$ are the eigenvectors described above. ## 2 Eigendecomposition: review ![[Pasted image 20230425132431.png|200]] Let $M$ be any $d \times d$ matrix. - $M$ defines a linear function, $x \mapsto M x$. This maps $\mathbb{R}^d$ to $\mathbb{R}^d$. - We say $u \in \mathbb{R}^d$ is an *eigenvector* of $M$ if $M u=\lambda u$ for some scaling constant $\lambda$. This $\lambda$ is the *eigenvalue* associated with $u$. - $M$ maps eigenvector $u$ onto the same direction. #lemma 3) Let $M$ be any real symmetric $d \times d$ matrix. Then $M$ has - $d$ eigenvalues $\lambda_1, \ldots, \lambda_d$ - corresponding eigenvectors $u_1, \ldots, u_d \in \mathbb{R}^d$ that are orthonormal - Can think of $u_1, \ldots, u_d$ as the *axes of the natural coordinate system* for $M$. ## 3 Spectral (Eigen) Decomposition Let $M$ be any real symmetric $d \times d$ matrix. Then $M$ has orthonormal eigenvectors $u_1, \ldots, u_d \in \mathbb{R}^d$ and corresponding eigenvalues $\lambda_1, \ldots, \lambda_d$. ***Spectral decomposition***: Another way to write $M$ : ![[Pasted image 20230425133429.png|500]] Thus $M x=U \Lambda U^T x$ : - $U^T$ rewrites $x$ in the $\left\{u_i\right\}$ coordinate system - $\Lambda$ is a simple coordinate scaling in that basis - this is the *covariance matrix* of projected data. - $U$ sends the scaled vector back into the usual coordinate basis ## 4 PCA Optimality Properties ### 4.1 Linear vs. affine subspace - *Linear* subspace $L$ has to go though origin - *Affine* subspace $A$ doesn't have to go through origin - We can always find $p_0\in \mathbb{R}_{}^{d}$ such that $A=L+p_0$ ### 4.2 Best approximating linear subspace Given: $n$ points $x_1, \ldots, x_n \in \mathbb{R}^d$ and $k<d$. - Choose a $k$-dimensional linear subspace "close to the data" - Approximate each $x_i$ by its projection $\tilde{x}_i$ onto this subspace *Goal*: Minimize the distortion $\sum_{i=1}^n\left\|x_i-\tilde{x}_i\right\|^2$ *Claim: **PCA finds the optimal subspace*** *Proof*. Say we project to a $k$-dimensional subspace $S$ spanned by orthonormal vectors $p_1,\cdots,p_k$. Let $p=\begin{bmatrix}p_1&\cdots&p_k\end{bmatrix}$ , where $p^\top p=I_k$ (i.e. $p$ is orthonormal). We aim to find $p$ which minimizes $\sum_{i=1}^n\left\|x_i-p^\top px_i\right\|^2$. WTS: $p_1,\cdots p_k$ are the top $k$ eigenvectors of the covariance matrix $\sum\limits_{i=1}^nx_ix_i^\top$. ### 4.3 Best approximating affine subspace Pick any $n$ points $x_1, \ldots, x_n \in \mathbb{R}^d$ and any $k<d$. - Let $\mu$ be the empirical average of the $\left\{x_i\right\}$ and $\Sigma$ the empirical covariance matrix. - Let $u_1, \ldots, u_k$ be the top $k$ eigenvectors of $\Sigma$. Make these the columns of a $d \times k$ matrix $U$. Projection onto the best approximating *affine subspace*: $ x \longmapsto U U^{\top}(x-\mu)+\mu $ ## 5 Random Projection (Johnson-Lindenstrauss Lemma) *Claim*: Any set of $n$ points is approximately embeddable in $O(\log n)$ dimensions. - Pick any $0<\epsilon \leq 1 / 2$ and set $k=\left(4 / \epsilon^2\right) \log n$ - Any $n$ points in $\mathbb{R}^d$ can be embedded into $\mathbb{R}^k$, such that *each* of the interpoint (Euclidean) distances is distorted by at most a multiplicative factor of $1 \pm \epsilon$. - Moreover, a projection into a random k-dimensional subspace will achieve this with probability close to 1 . $\begin{align*} \forall i,j\quad \sqrt{\frac{d}{k}}\Vert R^\top x_i-R^\top x_j \Vert_2=(1\pm \epsilon)\Vert x_i-x_j\Vert_2 \end{align*}$ **Choosing a random matrix**: Want to project into a random k-dim subspace of $\mathbb{R}_{}^{d}$ 1. Pick $k$ random unit vectors that are orthogonal - Pick $k$ vectors at random from the unit sphere - (Optional) Orthogonalize them using Gram-Schmidt 1. Let $R$ be the $d\times k$ matrix with these columns. 2. Projection: $x\mapsto R^\top x$ ## 6 SVD: Generalizing Spectral Decomposition For symmetric matrices (e.g. covariance matrices), we have seen: - Results about existence of eigenvalues and eigenvectors - Eigenvectors form an alternative basis - Resulting spectral decomposition, used in PCA What about arbitrary matrices $M \in \mathbb{R}^{p \times q}$ (assume $p<q$ without loss of generality)? $\begin{align*} M\rightarrow \begin{cases}MM^\top \quad(p\times p)\\M^\top M \quad(q\times q)\end{cases} \end{align*}$ Any $p\times q,\;(p\le q)$ matrix has (thin) ***singular value decomposition***: ![[Pasted image 20230503112221.png|500]] - $u_1, \ldots, u_p$ are orthonormal vectors in $\mathbb{R}^p$ - $v_1, \ldots, v_p$ are orthonormal vectors in $\mathbb{R}^q$ - $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_p \geq 0$ are singular values Approximating $M$ concisely by choosing: First $k$ columns in $U$, first $k$ singular values, and first $k$ rows in $V^\top$ ![[Pasted image 20230503112456.png|500]] ### Optimal property of SVD Let $M$ be any $p \times q$ matrix. Want to approximate $M$ by a $p \times q$ matrix $\widehat{M}$ of the form $U W$ : - $U$ is $p \times k$ and $W$ is $k \times q$ - $k \leq p, q$ is of our choosing ![[Pasted image 20230503112838.png|300]] *Claim*: SVD yields the best such approximation $\widehat{M}$, minimizing the squared error $ \sum_{i, j}\left(M_{i j}-\widehat{M}_{i j}\right)^2 $Use cases: 1. A more compact way to store $M$ 2. Gives an embedding of data 3. Give a simple solution to *matrix completion problem* ## 7 Examples #example Compute projection of $x=[2,3]^T$ along: - $x_1$-axis: $\begin{bmatrix}1\\0\end{bmatrix}^T x=2$ - The direction of $[1,-1]^T$: $[2,3]^T\cdot [1,-1]^T / \sqrt{2}=-1/\sqrt{2}$ #example What are the eigenvectors and eigenvalues of: $\begin{align*} M&= \left(\begin{array}{ccc} 2 & 0 & 0 \\ 0 & -1 & 0 \\ 0 & 0 & 10 \end{array}\right) ;\; u_1,u_2,u_3 = \begin{bmatrix}1\\0\\0\end{bmatrix},\begin{bmatrix}0\\1\\0\end{bmatrix},\begin{bmatrix}0\\0\\1\end{bmatrix};\;\lambda_1,\lambda_2,\lambda_3=2,-1,10 \end{align*}$ #example $\begin{align*} M=\left(\begin{array}{cc}1 & -2 \\-2 & 1 \end{array}\right) \text { has eigenvectors } u_1=\frac{1}{\sqrt{2}}\left(\begin{array}{l} 1 \\ 1 \end{array}\right), u_2=\frac{1}{\sqrt{2}}\left(\begin{array}{c} -1 \\ 1 \end{array}\right) \end{align*}$ 1. $u_1,u_2$ are orthonormal 2. $Mu_1=1/\sqrt{2}[-1,-1]^T,\;\lambda_1=-1$. $Mu_2= ....$ 3. $M=\frac{1}{{2}}\begin{bmatrix}1&-1\\1&1\end{bmatrix}\begin{bmatrix}-1&0\\0&3\end{bmatrix}\begin{bmatrix}1&1\\-1&1\end{bmatrix}$ ### 7.1 Case study: quantifying personality *Lexical hypothesis*: most important personality characteristics have become encoded in natural language. - Allport and Odbert (1936): identified 4500 words describing personality traits. - Group these words into (approximate) synonyms, by manual clustering. E.g. Norman (1967) - Data collection: subjects whether these words describe them. *PCA*: A single PCA dimension would account for both "trust" and "generosity", which are highly correlated ![[4_svd 2023-04-25 13.45.02.excalidraw.svg|500]] - Apply PCA to rows of this matrix. - Gets "big five" taxonomy: Extraversion, Agreeableness, Conscientiousness, Neuroticism, Openness.