# Projection and SVD (spectral methods)
Slides: [[lec7_informative_projection.pdf|informative projections]], [[lec8_svd.pdf|svd]]
## 1 Intro: Linear Projection and PCA
*Motivation*: Compress MNIST: Eliminate pixels with low variance. Can discard ~300 pixels without losing overall variance $=\sum\limits_{i=1}^{789}\text{var}({x_i})$
![[4_svd 2023-04-25 12.36.39.excalidraw.svg|400]]
The ***projection*** of $x \in \mathbb{R}^d$ in the direction $u \in \mathbb{R}^d$ ($\|u\|_2=1$) is the dot product:
$\begin{align*}
x \cdot u=u \cdot x=u^T x=\sum_{i=1}^d u_i x_i
\end{align*}$
The *direction of maximum variance* is the best direction to project:
$\begin{align*}
x \mapsto u \cdot x \quad \text { for some unit direction } u \in \mathbb{R}^d
\end{align*}$
#lemma 1) The *variance of X in direction $u$* is given by $u^T\Sigma u$, where $\Sigma\in \mathbb{R}_{}^{d\times d}$ is the *covariance matrix* of $X$. ($\Sigma$ is always PSD).
*Proof*.
$\begin{align*}
\Sigma&= \mathbb{E}[(X-\mu)(X-\mu )^T]\\
u^T\Sigma u&= \mathbb{E}[u^T(X-\mu)(X-\mu)^Tu]\\
&= \mathbb{E}\left[(u^T(X-\mu))^2\right]=\text{var}(u^TX)
\end{align*}$
#lemma 2) $u^T\Sigma u$ is maximized when $u=$ 1st *eigenvector* of $\Sigma$ (i.e. the 1st *principal component*), and the maximum variance value is the corresponding *eigenvalue* (i.e. the *variance* in the direction of the 1st principal component).
### 1.1 Projection onto $k$-dimensional subspace
Projecting $x \in \mathbb{R}^d$ into the $k$-dimensional subspace defined by vectors $u_1, \ldots, u_k \in \mathbb{R}^d$.
Easiest when the $u_i$ 's are orthonormal:
- They have length one.
- They are at right angles to each other: $u_i \cdot u_j=0$ when $i \neq j$
- $U$ is the $d\times k$ matrix w/ columns $u_1,\cdots,u_k$
$\begin{align*}
\text{Projection}&= \left(x \cdot u_1, x \cdot u_2, \ldots, x \cdot u_k\right)=\underbrace{\left(\begin{array}{c}
\longleftarrow u_1 \longrightarrow \\
\vdots \\
\longleftarrow u_k \longrightarrow
\end{array}\right)}_{\text {call this } U^T}\left(\begin{array}{l}
\uparrow \\
x \\
\downarrow
\end{array}\right)&=U^Tx\\
\text{Reconstruction}&= \left(x \cdot u_1\right) u_1+\left(x \cdot u_2\right) u_2+\cdots+\left(x \cdot u_k\right) u_k&=U U^T x
\end{align*}$
### 1.2 PCA
![[4_svd 2023-04-25 13.10.30.excalidraw.svg|700]]
Let $\Sigma$ be the $d \times d$ covariance matrix of $X$. In $O\left(d^3\right)$ time, we can compute its *eigendecomposition*, consisting of
- Real *eigenvalues* $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d$
- Corresponding *eigenvectors* $u_1, \ldots, u_d \in \mathbb{R}^d$ that are orthonormal (unit length and at right angles to each other)
***Principal component analysis***: Suppose we want to map data $X \in \mathbb{R}^d$ to just $k$ dimensions, while capturing as much of the variance of $X$ as possible. The best choice of projection is:
$
x \mapsto\left(u_1 \cdot x, u_2 \cdot x, \ldots, u_k \cdot x\right)=U^Tx
$
where $u_i$ are the eigenvectors described above.
## 2 Eigendecomposition: review
![[Pasted image 20230425132431.png|200]]
Let $M$ be any $d \times d$ matrix.
- $M$ defines a linear function, $x \mapsto M x$. This maps $\mathbb{R}^d$ to $\mathbb{R}^d$.
- We say $u \in \mathbb{R}^d$ is an *eigenvector* of $M$ if $M u=\lambda u$ for some scaling constant $\lambda$. This $\lambda$ is the *eigenvalue* associated with $u$.
- $M$ maps eigenvector $u$ onto the same direction.
#lemma 3) Let $M$ be any real symmetric $d \times d$ matrix. Then $M$ has
- $d$ eigenvalues $\lambda_1, \ldots, \lambda_d$
- corresponding eigenvectors $u_1, \ldots, u_d \in \mathbb{R}^d$ that are orthonormal
- Can think of $u_1, \ldots, u_d$ as the *axes of the natural coordinate system* for $M$.
## 3 Spectral (Eigen) Decomposition
Let $M$ be any real symmetric $d \times d$ matrix. Then $M$ has orthonormal eigenvectors $u_1, \ldots, u_d \in \mathbb{R}^d$ and corresponding eigenvalues $\lambda_1, \ldots, \lambda_d$.
***Spectral decomposition***: Another way to write $M$ :
![[Pasted image 20230425133429.png|500]]
Thus $M x=U \Lambda U^T x$ :
- $U^T$ rewrites $x$ in the $\left\{u_i\right\}$ coordinate system
- $\Lambda$ is a simple coordinate scaling in that basis - this is the *covariance matrix* of projected data.
- $U$ sends the scaled vector back into the usual coordinate basis
## 4 PCA Optimality Properties
### 4.1 Linear vs. affine subspace
- *Linear* subspace $L$ has to go though origin
- *Affine* subspace $A$ doesn't have to go through origin
- We can always find $p_0\in \mathbb{R}_{}^{d}$ such that $A=L+p_0$
### 4.2 Best approximating linear subspace
Given: $n$ points $x_1, \ldots, x_n \in \mathbb{R}^d$ and $k<d$.
- Choose a $k$-dimensional linear subspace "close to the data"
- Approximate each $x_i$ by its projection $\tilde{x}_i$ onto this subspace
*Goal*: Minimize the distortion $\sum_{i=1}^n\left\|x_i-\tilde{x}_i\right\|^2$
*Claim: **PCA finds the optimal subspace***
*Proof*. Say we project to a $k$-dimensional subspace $S$ spanned by orthonormal vectors $p_1,\cdots,p_k$. Let $p=\begin{bmatrix}p_1&\cdots&p_k\end{bmatrix}$ , where $p^\top p=I_k$ (i.e. $p$ is orthonormal). We aim to find $p$ which minimizes $\sum_{i=1}^n\left\|x_i-p^\top px_i\right\|^2$. WTS: $p_1,\cdots p_k$ are the top $k$ eigenvectors of the covariance matrix $\sum\limits_{i=1}^nx_ix_i^\top$.
### 4.3 Best approximating affine subspace
Pick any $n$ points $x_1, \ldots, x_n \in \mathbb{R}^d$ and any $k<d$.
- Let $\mu$ be the empirical average of the $\left\{x_i\right\}$ and $\Sigma$ the empirical covariance matrix.
- Let $u_1, \ldots, u_k$ be the top $k$ eigenvectors of $\Sigma$. Make these the columns of a $d \times k$ matrix $U$.
Projection onto the best approximating *affine subspace*:
$
x \longmapsto U U^{\top}(x-\mu)+\mu
$
## 5 Random Projection (Johnson-Lindenstrauss Lemma)
*Claim*: Any set of $n$ points is approximately embeddable in $O(\log n)$ dimensions.
- Pick any $0<\epsilon \leq 1 / 2$ and set $k=\left(4 / \epsilon^2\right) \log n$
- Any $n$ points in $\mathbb{R}^d$ can be embedded into $\mathbb{R}^k$, such that *each* of the interpoint (Euclidean) distances is distorted by at most a multiplicative factor of $1 \pm \epsilon$.
- Moreover, a projection into a random k-dimensional subspace will achieve this with probability close to 1 .
$\begin{align*}
\forall i,j\quad \sqrt{\frac{d}{k}}\Vert R^\top x_i-R^\top x_j \Vert_2=(1\pm \epsilon)\Vert x_i-x_j\Vert_2
\end{align*}$
**Choosing a random matrix**: Want to project into a random k-dim subspace of $\mathbb{R}_{}^{d}$
1. Pick $k$ random unit vectors that are orthogonal
- Pick $k$ vectors at random from the unit sphere
- (Optional) Orthogonalize them using Gram-Schmidt
1. Let $R$ be the $d\times k$ matrix with these columns.
2. Projection: $x\mapsto R^\top x$
## 6 SVD: Generalizing Spectral Decomposition
For symmetric matrices (e.g. covariance matrices), we have seen:
- Results about existence of eigenvalues and eigenvectors
- Eigenvectors form an alternative basis
- Resulting spectral decomposition, used in PCA
What about arbitrary matrices $M \in \mathbb{R}^{p \times q}$ (assume $p<q$ without loss of generality)?
$\begin{align*}
M\rightarrow \begin{cases}MM^\top \quad(p\times p)\\M^\top M \quad(q\times q)\end{cases}
\end{align*}$
Any $p\times q,\;(p\le q)$ matrix has (thin) ***singular value decomposition***:
![[Pasted image 20230503112221.png|500]]
- $u_1, \ldots, u_p$ are orthonormal vectors in $\mathbb{R}^p$
- $v_1, \ldots, v_p$ are orthonormal vectors in $\mathbb{R}^q$
- $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_p \geq 0$ are singular values
Approximating $M$ concisely by choosing: First $k$ columns in $U$, first $k$ singular values, and first $k$ rows in $V^\top$
![[Pasted image 20230503112456.png|500]]
### Optimal property of SVD
Let $M$ be any $p \times q$ matrix. Want to approximate $M$ by a $p \times q$ matrix $\widehat{M}$ of the form $U W$ :
- $U$ is $p \times k$ and $W$ is $k \times q$
- $k \leq p, q$ is of our choosing
![[Pasted image 20230503112838.png|300]]
*Claim*: SVD yields the best such approximation $\widehat{M}$, minimizing the squared error
$
\sum_{i, j}\left(M_{i j}-\widehat{M}_{i j}\right)^2
$Use cases:
1. A more compact way to store $M$
2. Gives an embedding of data
3. Give a simple solution to *matrix completion problem*
## 7 Examples
#example Compute projection of $x=[2,3]^T$ along:
- $x_1$-axis: $\begin{bmatrix}1\\0\end{bmatrix}^T x=2$
- The direction of $[1,-1]^T$: $[2,3]^T\cdot [1,-1]^T / \sqrt{2}=-1/\sqrt{2}$
#example What are the eigenvectors and eigenvalues of:
$\begin{align*}
M&= \left(\begin{array}{ccc}
2 & 0 & 0 \\
0 & -1 & 0 \\
0 & 0 & 10
\end{array}\right) ;\;
u_1,u_2,u_3 = \begin{bmatrix}1\\0\\0\end{bmatrix},\begin{bmatrix}0\\1\\0\end{bmatrix},\begin{bmatrix}0\\0\\1\end{bmatrix};\;\lambda_1,\lambda_2,\lambda_3=2,-1,10
\end{align*}$
#example
$\begin{align*}
M=\left(\begin{array}{cc}1 & -2 \\-2 & 1
\end{array}\right) \text { has eigenvectors } u_1=\frac{1}{\sqrt{2}}\left(\begin{array}{l}
1 \\ 1 \end{array}\right), u_2=\frac{1}{\sqrt{2}}\left(\begin{array}{c}
-1 \\
1 \end{array}\right)
\end{align*}$
1. $u_1,u_2$ are orthonormal
2. $Mu_1=1/\sqrt{2}[-1,-1]^T,\;\lambda_1=-1$. $Mu_2= ....$
3. $M=\frac{1}{{2}}\begin{bmatrix}1&-1\\1&1\end{bmatrix}\begin{bmatrix}-1&0\\0&3\end{bmatrix}\begin{bmatrix}1&1\\-1&1\end{bmatrix}$
### 7.1 Case study: quantifying personality
*Lexical hypothesis*: most important personality characteristics have become encoded in natural language.
- Allport and Odbert (1936): identified 4500 words describing personality traits.
- Group these words into (approximate) synonyms, by manual clustering. E.g. Norman (1967)
- Data collection: subjects whether these words describe them.
*PCA*: A single PCA dimension would account for both "trust" and "generosity", which are highly correlated
![[4_svd 2023-04-25 13.45.02.excalidraw.svg|500]]
- Apply PCA to rows of this matrix.
- Gets "big five" taxonomy: Extraversion, Agreeableness, Conscientiousness, Neuroticism, Openness.