# Clustering Algorithms
Slides: [[lecture8_kmeans.pdf]], [[lecture9_gaussin_mixture.pdf]]
Video: https://podcast.ucsd.edu/watch/sp23/cse151a_a00/8
Article: [Introduction to mixture models](https://stephens999.github.io/fiveMinuteStats/intro_to_mixture_models.html)
## 1 Intro: K-Partitioning Problem
*Goal*: Group $n$ data points by optimizing a specific objective function All groups are *disjoint* (a.k.a. *hard clustering*)
- $C_k$ : the k-th cluster; $\boldsymbol{c}_{\boldsymbol{k}}$ the cluster center of the k-th cluster.
- $\boldsymbol{x}_{\boldsymbol{i}}$ : the feature vector of $\mathrm{i}$-th data point
- Objective: *Sum of Squared Errors (SSE)*
$\begin{align*}
\mathrm{SSE}=\sum_{k=1}^K \sum_{i \in C_k}\left|x_i-c_k\right|_2^2
\end{align*}$
*Problem formulation*: Given $K$, find a partition of $K$ clusters that optimizes the chosen partition criterion
- Searching for global optimum is too costly: $O(n^k/k!)$
- Need heuristic methods (i.e. greedy algorithms): *K-Means, K-Medians, K-Medoids*
## 2 K-Means, K-Medians and K-Medoids
The algorithm for K-mean, K-medians, K-medoids is the same. Only difference is that they minimize different objective.
### 2.1 K-means (minimizes $L_2$ distance)
*Algorithm*. Select K points as initial centroids. Repeat until convergence criterion is satisfied:
1. (Re)assign $\mathrm{K}$ clusters by assigning each point to its closest centroid
2. Re-compute the centroids (i.e. mean point) of each cluster
$\begin{align*}
\min SSE&= \min\sum\limits_{k=1}^K\sum\limits_{i\in C_k}\Vert x_i-c_k \Vert_2^2,\quad
c_k\coloneqq \frac{\sum_{i=C_k}x_i}{|C_k|}\\
\text{Proof. }\frac{\partial }{\partial c_k} &= \sum\limits_{i\in C_k} 2(x_i-c_k)=0 \quad\rightarrow\quad c_k=\frac{\sum_{i=C_k}x_i}{|C_k|}
\end{align*}$
- Different kinds of *metrics* can be used: Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
*Time complexity* of K-means: $\boxed{O(tnKd)}$
$\begin{align*}
\begin{array}{c|c|c|c}
t&n&K&d\\\hline
\text{\# iters}&\text{\# data points}&\text{\# clusters}&\text{dimension of feature}
\end{array}
\end{align*}$
*Heuristics*.
- Run K-means multiple times and pick the best-performing one. K-means is easily parallelizable.
- The "elbow" method for determining $K$. (plot $SSE(k)$ against $k$)
- *$\text{K-means}_{++}$* : Select 1st centroid at random, next centroid farthest from current set of centroids.
*Cosine similarity* and $L_2$ norm: Equivalent to fist normalize data, then apply K-means method
$\begin{align*}
\cos \theta&= \frac{x_i\cdot x_j}{\Vert x_i \Vert_2 \Vert x_j \Vert_2}=x_i^\top x_j & \quad\text{if } x_i,x_j \text{ are normalized}\\
\Vert x_i-x_j \Vert_2^2&= (x_i-x_j)^\top(x_i-x_j)=x_i^\top x_i+x_j^\top x_j-2x_i^\top x_j\\
&= 2-2\cos \theta\\
\text{Objective: }& \max \cos \theta=\max x_i^\top x_j \quad\equiv\quad \min \text{SSE}
\end{align*}$
### 2.2 K-medians (minimizes $L_1$ distance)
$\begin{align*}
\min SAE&= \sum\limits_{k=1}^K\sum\limits_{i\in C_k}\Vert x_i - c_k \Vert_1,\; \quad c_k\coloneqq \text{median}(\set{x_i|i\in C_k})
\end{align*}$
### 2.3 K-medoids (minimizes cosine similarity)
- Goal: Find $K$ most centrally located data point in a cluster
- A *medoid* is an existing data point in input
*PAM* - Partitioning Around Medoids [Kaufmann & Rousseeuw 1987]. $\boxed{O(K(n-K)^2 d)}$
*CLARANS* [Ng & Han, 1994]: $\boxed{O\big((Ks^2+K(n-K))d\big)}$
- Randomized re-sampling, ensuring efficiency and quality
### 2.4 Kernel K-Means (similar to [[4_svm#3 Kernel SVM|kernel svm]])
Project data into the high-dimensional kernel space, and then perform K-Means clustering
It is typically much slower than K-Means in the original input space
![[Pasted image 20230508140532.png|500]]
## 3 Mixture Models - Probabilistic Clustering
***Mixture models***:
- Assume the data are generated by a mixture of underlying probability distributions
- Infer the parameters of these components (i.e. clusters) and assign data points to specific components of the mixture
*Probabilistic topic models* for text clustering and analysis
- Probabilistic latent semantic analysis (PLSA)
- Latent Dirichlet allocation (LDA)
**Notations**:
- A set of K *probabilistic clusters*, $\boldsymbol{C}=C_1, C_2, \ldots, C_K$
- Each cluster has its own *probability density functions* $f_1, f_2, \ldots, f_K$
- Their probabilities (or say weights, priors, ...) are $w_1, w_2, \ldots, w_K$
- The probability of the $\mathrm{i}$-th data point $\boldsymbol{x}_{\boldsymbol{i}}$ is generated by the cluster $C_j$ :
$
P\left(\boldsymbol{x}_{\boldsymbol{i}}, C_j\right)=P\left(C_j\right) P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid C_j\right)=w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right)
$
- The probability of the $i$-th data point $\boldsymbol{x}_{\boldsymbol{i}}$ is generated by the set of clusters $\boldsymbol{C}$ :
$
P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid \boldsymbol{C}\right)=\sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right)
$
- Since data points are assumed to be generated independently, for a data set $D=$ $\left\{\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_{\boldsymbol{n}}\right\}$
$
P(D \mid \boldsymbol{C})=\prod_{i=1}^n P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid \boldsymbol{C}\right)=\prod_{i=1}^n \sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right)
$
*Goal*: Find a set $\boldsymbol{C}$ of K probabilistic clusters so that $P(D \mid \boldsymbol{C})$ is maximized
- Maximizing $P(D \mid \boldsymbol{C})$ is often *intractable* since the probability density function of a cluster can take an arbitrarily complicated form
- To make computation feasible (as a compromise), assume the probability density functions are some *parameterized distributions*
### 3.1 Gaussian mixture and MLE
We assume each cluster $C_j$ is characterized by a multivariate normal distribution
$
f_j(\boldsymbol{x})=f\left(\boldsymbol{x} \mid \boldsymbol{\mu}_{\boldsymbol{j}}, \boldsymbol{\Sigma}_j\right)=\frac{1}{\sqrt{(2 \pi)^d\left|\Sigma_j\right|}} e^{-\frac{\left(\boldsymbol{x}-\boldsymbol{\mu}_j\right)^T \Sigma_{\boldsymbol{j}}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_j\right)}{2}}
$
Here, the cluster mean $\boldsymbol{\mu}_{\boldsymbol{j}}$ and covariance matrix $\boldsymbol{\Sigma}_j$ are parameters to learn We assume the probability density function of $\boldsymbol{x}$ is given as a Gaussian mixture model over all the $\mathrm{K}$ clusters defined as
$
f(\boldsymbol{x})=\sum_{j=1}^K P\left(C_j\right) f_j(\boldsymbol{x})=\sum_{j=1}^K w_j f\left(\boldsymbol{x} \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_{\mathbf{j}}\right)=P(x|C ),\quad \sum_{j=1}^K w_j=1
$
- Here $w_j$ is the *cluster prior*, which is also parameters to learn.
Since data points are assumed to be generated independently, for a data set $D=$ $\left\{\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_{\boldsymbol{n}}\right\}$
$
P(D \mid \boldsymbol{C})=\prod_{i=1}^n P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid \boldsymbol{C}\right)=\prod_{i=1}^n \sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right)
$
*Maximum Likelihood Estimation* (MLE) for Gaussian mixture:
- Maximize $P(D \mid C)$ by learning parameters $\theta: \theta^*=\arg \max _\theta P(D \mid \boldsymbol{C})$
- Here, $\theta=\left\{\mu_{1 \ldots K}, \Sigma_{1 \ldots K}, w_{1 \ldots K}\right\}$
- Sometimes, maximizing the log-likelihood might be easier
$\begin{align*}
\log P(D \mid \boldsymbol{C})&= \sum_{i=1}^n \log P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid \boldsymbol{C}\right)=\sum_{i=1}^n \log \sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right)\\
\theta^*&= \arg \max _\theta \sum_{i=1}^n \log \sum_{j=1}^K w_j f_j\left(x_i\right)
\end{align*}$
- Direct optimization is hard -> use **EM to find the MLE**
### 3.2 Expectation maximization (EM) for Gaussian mixture
![[Pasted image 20230508150338.png|700]]
**Expectation Step**: Given the current estimation of $\theta$, compute the cluster ***posterior***
- Can also think of this as "fixing the cluster assignment"
$
P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)=\frac{P\left(C_j\right) P\left(x_i \mid C_j\right)}{P\left(\boldsymbol{x}_i\right)}=\frac{w_j f_j\left(\boldsymbol{x}_i\right)}{\sum_{a=1}^K w_a f_a\left(\boldsymbol{x}_{\boldsymbol{i}}\right)}
$
**Maximization Step**: Given $P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)$, how to estimate $\theta$ ?
- $\theta^*=\arg \max _\theta \sum_{i=1}^n \log \sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right)$
- Solve the partial equals to 0 equations $\rightarrow$ Weighted mean:
$
w_j=\frac{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_i\right)}{n}, \quad \boldsymbol{\mu}_{\boldsymbol{j}}=\frac{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right) \boldsymbol{x}_{\boldsymbol{i}}}{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)}, \quad \boldsymbol{\Sigma}_{\boldsymbol{j}}=\frac{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_i\right)\left(\boldsymbol{x}_{\boldsymbol{i}}-\boldsymbol{\mu}_{\boldsymbol{j}}\right)\left(\boldsymbol{x}_{\boldsymbol{i}}-\boldsymbol{\mu}_{\boldsymbol{j}}\right)^{\boldsymbol{T}}}{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_i\right)}
$
- $\mathbf{u}_j$ is the weighted mean of data points, $\mathbf{\Sigma}_j$ is the empirical covariance.
**EM algorithm for Gaussian mixture**:
Initialize $\theta=\left\{\boldsymbol{\mu}_{1 \ldots K}, \boldsymbol{\Sigma}_{1 \ldots K}, w_{1 \ldots K}\right\}$
- Randomly initialize $\boldsymbol{\mu}_{1 \ldots K}$
- $\boldsymbol{\Sigma}_i \leftarrow \boldsymbol{I}$ (identity matrix)
Repeat until convergence:
- **E-Step**: Assigns data points to clusters according to the current parameters of probabilistic clusters
- For $\mathrm{j}=1$ to $\mathrm{K}$. For $\mathrm{i}=1$ to $\mathrm{n}$: Compute $P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)$
- **M-Step**: Finds the new clustering or parameters that minimize SSE or the expected likelihood
- For $\mathrm{j}=1$ to $\mathrm{K}$: Compute $w_j, \boldsymbol{\mu}_{\boldsymbol{j}}$, and $\boldsymbol{\Sigma}_{\boldsymbol{j}}$
**Heuristics**
- First run K-means and get the cluster assignment, then initialize Gaussian mixture based on this cluster assignment (makes EM converges faster)
- A Gaussian component can *collapse* onto a particular data point (i.e. singularity, variance becomes 0)
- When detecting a Gaussian component is collapsing, reset its mean and covariance, and then continue with the optimization
## 4 K-means vs. Gaussian Mixture
**K-means as EM**: Cluster centroids are model parameters $\theta$
- Cluster assignments of each data point is $P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)$. The likelihood is inversely proportional to the distance
- *E-step*: Given centroids, a data point is expected to belong to the closest cluster.
- *M-step*: Given the cluster assignments, the algorithm adjusts the centroids to minimize the sum of their distance to assigned data points
**K-means vs. Gaussian Mixture**: Assignment $P(C_j|x_i)$
$\begin{align*}
\begin{array}{c|c}
\text{K-means (hard clustering)} & \text{Gaussian Mixture (soft clustering)}\\\hline
\text{1-hot entoding}&\text{distribution}\\
\text{Every data point belongs to exactly 1 cluster}&\text{Data points belong to multiple clusters }
\end{array}
\end{align*}$
- Gaussian Mixture can be viewed as a soft K-means.
**Pros** of Gaussian mixture:
- More general than partitioning
- Clusters can be characterized by small number of params
- Results may satisfy statistical assumptions for generative models
**Cons** of Gaussian mixture:
- Converges to local optimum -> solvable by running multiple times with difference initialization
- Needs large datasets. Hard to estimate number of clusters.
- Computationally expensive if the number of distributions is large.
## Readings
- A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. 1977
- G. J. McLachlan and K. E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley & Sons, 1988
- K. Burnham and D. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer Verlag, 2002
- C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006
- M. J. Zaki and W. Meira, Jr.. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, 2014
- H. Deng and J. Han, Probabilistic Models for Clustering, in (Chapter 3) C. Aggarwal and C. K. Reddy (eds.),Data Clustering: Algorithms and Applications. CRC Press, 2014