# Clustering Algorithms Slides: [[lecture8_kmeans.pdf]], [[lecture9_gaussin_mixture.pdf]] Video: https://podcast.ucsd.edu/watch/sp23/cse151a_a00/8 Article: [Introduction to mixture models](https://stephens999.github.io/fiveMinuteStats/intro_to_mixture_models.html) ## 1 Intro: K-Partitioning Problem *Goal*: Group $n$ data points by optimizing a specific objective function All groups are *disjoint* (a.k.a. *hard clustering*) - $C_k$ : the k-th cluster; $\boldsymbol{c}_{\boldsymbol{k}}$ the cluster center of the k-th cluster. - $\boldsymbol{x}_{\boldsymbol{i}}$ : the feature vector of $\mathrm{i}$-th data point - Objective: *Sum of Squared Errors (SSE)* $\begin{align*} \mathrm{SSE}=\sum_{k=1}^K \sum_{i \in C_k}\left|x_i-c_k\right|_2^2 \end{align*}$ *Problem formulation*: Given $K$, find a partition of $K$ clusters that optimizes the chosen partition criterion - Searching for global optimum is too costly: $O(n^k/k!)$ - Need heuristic methods (i.e. greedy algorithms): *K-Means, K-Medians, K-Medoids* ## 2 K-Means, K-Medians and K-Medoids The algorithm for K-mean, K-medians, K-medoids is the same. Only difference is that they minimize different objective. ### 2.1 K-means (minimizes $L_2$ distance) *Algorithm*. Select K points as initial centroids. Repeat until convergence criterion is satisfied: 1. (Re)assign $\mathrm{K}$ clusters by assigning each point to its closest centroid 2. Re-compute the centroids (i.e. mean point) of each cluster $\begin{align*} \min SSE&= \min\sum\limits_{k=1}^K\sum\limits_{i\in C_k}\Vert x_i-c_k \Vert_2^2,\quad c_k\coloneqq \frac{\sum_{i=C_k}x_i}{|C_k|}\\ \text{Proof. }\frac{\partial }{\partial c_k} &= \sum\limits_{i\in C_k} 2(x_i-c_k)=0 \quad\rightarrow\quad c_k=\frac{\sum_{i=C_k}x_i}{|C_k|} \end{align*}$ - Different kinds of *metrics* can be used: Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity *Time complexity* of K-means: $\boxed{O(tnKd)}$ $\begin{align*} \begin{array}{c|c|c|c} t&n&K&d\\\hline \text{\# iters}&\text{\# data points}&\text{\# clusters}&\text{dimension of feature} \end{array} \end{align*}$ *Heuristics*. - Run K-means multiple times and pick the best-performing one. K-means is easily parallelizable. - The "elbow" method for determining $K$. (plot $SSE(k)$ against $k$) - *$\text{K-means}_{++}$* : Select 1st centroid at random, next centroid farthest from current set of centroids. *Cosine similarity* and $L_2$ norm: Equivalent to fist normalize data, then apply K-means method $\begin{align*} \cos \theta&= \frac{x_i\cdot x_j}{\Vert x_i \Vert_2 \Vert x_j \Vert_2}=x_i^\top x_j & \quad\text{if } x_i,x_j \text{ are normalized}\\ \Vert x_i-x_j \Vert_2^2&= (x_i-x_j)^\top(x_i-x_j)=x_i^\top x_i+x_j^\top x_j-2x_i^\top x_j\\ &= 2-2\cos \theta\\ \text{Objective: }& \max \cos \theta=\max x_i^\top x_j \quad\equiv\quad \min \text{SSE} \end{align*}$ ### 2.2 K-medians (minimizes $L_1$ distance) $\begin{align*} \min SAE&= \sum\limits_{k=1}^K\sum\limits_{i\in C_k}\Vert x_i - c_k \Vert_1,\; \quad c_k\coloneqq \text{median}(\set{x_i|i\in C_k}) \end{align*}$ ### 2.3 K-medoids (minimizes cosine similarity) - Goal: Find $K$ most centrally located data point in a cluster - A *medoid* is an existing data point in input *PAM* - Partitioning Around Medoids [Kaufmann & Rousseeuw 1987]. $\boxed{O(K(n-K)^2 d)}$ *CLARANS* [Ng & Han, 1994]: $\boxed{O\big((Ks^2+K(n-K))d\big)}$ - Randomized re-sampling, ensuring efficiency and quality ### 2.4 Kernel K-Means (similar to [[4_svm#3 Kernel SVM|kernel svm]]) Project data into the high-dimensional kernel space, and then perform K-Means clustering It is typically much slower than K-Means in the original input space ![[Pasted image 20230508140532.png|500]] ## 3 Mixture Models - Probabilistic Clustering ***Mixture models***: - Assume the data are generated by a mixture of underlying probability distributions - Infer the parameters of these components (i.e. clusters) and assign data points to specific components of the mixture *Probabilistic topic models* for text clustering and analysis - Probabilistic latent semantic analysis (PLSA) - Latent Dirichlet allocation (LDA) **Notations**: - A set of K *probabilistic clusters*, $\boldsymbol{C}=C_1, C_2, \ldots, C_K$ - Each cluster has its own *probability density functions* $f_1, f_2, \ldots, f_K$ - Their probabilities (or say weights, priors, ...) are $w_1, w_2, \ldots, w_K$ - The probability of the $\mathrm{i}$-th data point $\boldsymbol{x}_{\boldsymbol{i}}$ is generated by the cluster $C_j$ : $ P\left(\boldsymbol{x}_{\boldsymbol{i}}, C_j\right)=P\left(C_j\right) P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid C_j\right)=w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right) $ - The probability of the $i$-th data point $\boldsymbol{x}_{\boldsymbol{i}}$ is generated by the set of clusters $\boldsymbol{C}$ : $ P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid \boldsymbol{C}\right)=\sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right) $ - Since data points are assumed to be generated independently, for a data set $D=$ $\left\{\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_{\boldsymbol{n}}\right\}$ $ P(D \mid \boldsymbol{C})=\prod_{i=1}^n P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid \boldsymbol{C}\right)=\prod_{i=1}^n \sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right) $ *Goal*: Find a set $\boldsymbol{C}$ of K probabilistic clusters so that $P(D \mid \boldsymbol{C})$ is maximized - Maximizing $P(D \mid \boldsymbol{C})$ is often *intractable* since the probability density function of a cluster can take an arbitrarily complicated form - To make computation feasible (as a compromise), assume the probability density functions are some *parameterized distributions* ### 3.1 Gaussian mixture and MLE We assume each cluster $C_j$ is characterized by a multivariate normal distribution $ f_j(\boldsymbol{x})=f\left(\boldsymbol{x} \mid \boldsymbol{\mu}_{\boldsymbol{j}}, \boldsymbol{\Sigma}_j\right)=\frac{1}{\sqrt{(2 \pi)^d\left|\Sigma_j\right|}} e^{-\frac{\left(\boldsymbol{x}-\boldsymbol{\mu}_j\right)^T \Sigma_{\boldsymbol{j}}^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}_j\right)}{2}} $ Here, the cluster mean $\boldsymbol{\mu}_{\boldsymbol{j}}$ and covariance matrix $\boldsymbol{\Sigma}_j$ are parameters to learn We assume the probability density function of $\boldsymbol{x}$ is given as a Gaussian mixture model over all the $\mathrm{K}$ clusters defined as $ f(\boldsymbol{x})=\sum_{j=1}^K P\left(C_j\right) f_j(\boldsymbol{x})=\sum_{j=1}^K w_j f\left(\boldsymbol{x} \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_{\mathbf{j}}\right)=P(x|C ),\quad \sum_{j=1}^K w_j=1 $ - Here $w_j$ is the *cluster prior*, which is also parameters to learn. Since data points are assumed to be generated independently, for a data set $D=$ $\left\{\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_{\boldsymbol{n}}\right\}$ $ P(D \mid \boldsymbol{C})=\prod_{i=1}^n P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid \boldsymbol{C}\right)=\prod_{i=1}^n \sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right) $ *Maximum Likelihood Estimation* (MLE) for Gaussian mixture: - Maximize $P(D \mid C)$ by learning parameters $\theta: \theta^*=\arg \max _\theta P(D \mid \boldsymbol{C})$ - Here, $\theta=\left\{\mu_{1 \ldots K}, \Sigma_{1 \ldots K}, w_{1 \ldots K}\right\}$ - Sometimes, maximizing the log-likelihood might be easier $\begin{align*} \log P(D \mid \boldsymbol{C})&= \sum_{i=1}^n \log P\left(\boldsymbol{x}_{\boldsymbol{i}} \mid \boldsymbol{C}\right)=\sum_{i=1}^n \log \sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right)\\ \theta^*&= \arg \max _\theta \sum_{i=1}^n \log \sum_{j=1}^K w_j f_j\left(x_i\right) \end{align*}$ - Direct optimization is hard -> use **EM to find the MLE** ### 3.2 Expectation maximization (EM) for Gaussian mixture ![[Pasted image 20230508150338.png|700]] **Expectation Step**: Given the current estimation of $\theta$, compute the cluster ***posterior*** - Can also think of this as "fixing the cluster assignment" $ P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)=\frac{P\left(C_j\right) P\left(x_i \mid C_j\right)}{P\left(\boldsymbol{x}_i\right)}=\frac{w_j f_j\left(\boldsymbol{x}_i\right)}{\sum_{a=1}^K w_a f_a\left(\boldsymbol{x}_{\boldsymbol{i}}\right)} $ **Maximization Step**: Given $P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)$, how to estimate $\theta$ ? - $\theta^*=\arg \max _\theta \sum_{i=1}^n \log \sum_{j=1}^K w_j f_j\left(\boldsymbol{x}_{\boldsymbol{i}}\right)$ - Solve the partial equals to 0 equations $\rightarrow$ Weighted mean: $ w_j=\frac{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_i\right)}{n}, \quad \boldsymbol{\mu}_{\boldsymbol{j}}=\frac{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right) \boldsymbol{x}_{\boldsymbol{i}}}{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)}, \quad \boldsymbol{\Sigma}_{\boldsymbol{j}}=\frac{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_i\right)\left(\boldsymbol{x}_{\boldsymbol{i}}-\boldsymbol{\mu}_{\boldsymbol{j}}\right)\left(\boldsymbol{x}_{\boldsymbol{i}}-\boldsymbol{\mu}_{\boldsymbol{j}}\right)^{\boldsymbol{T}}}{\sum_{i=1}^n P\left(C_j \mid \boldsymbol{x}_i\right)} $ - $\mathbf{u}_j$ is the weighted mean of data points, $\mathbf{\Sigma}_j$ is the empirical covariance. **EM algorithm for Gaussian mixture**: Initialize $\theta=\left\{\boldsymbol{\mu}_{1 \ldots K}, \boldsymbol{\Sigma}_{1 \ldots K}, w_{1 \ldots K}\right\}$ - Randomly initialize $\boldsymbol{\mu}_{1 \ldots K}$ - $\boldsymbol{\Sigma}_i \leftarrow \boldsymbol{I}$ (identity matrix) Repeat until convergence: - **E-Step**: Assigns data points to clusters according to the current parameters of probabilistic clusters - For $\mathrm{j}=1$ to $\mathrm{K}$. For $\mathrm{i}=1$ to $\mathrm{n}$: Compute $P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)$ - **M-Step**: Finds the new clustering or parameters that minimize SSE or the expected likelihood - For $\mathrm{j}=1$ to $\mathrm{K}$: Compute $w_j, \boldsymbol{\mu}_{\boldsymbol{j}}$, and $\boldsymbol{\Sigma}_{\boldsymbol{j}}$ **Heuristics** - First run K-means and get the cluster assignment, then initialize Gaussian mixture based on this cluster assignment (makes EM converges faster) - A Gaussian component can *collapse* onto a particular data point (i.e. singularity, variance becomes 0) - When detecting a Gaussian component is collapsing, reset its mean and covariance, and then continue with the optimization ## 4 K-means vs. Gaussian Mixture **K-means as EM**: Cluster centroids are model parameters $\theta$ - Cluster assignments of each data point is $P\left(C_j \mid \boldsymbol{x}_{\boldsymbol{i}}\right)$. The likelihood is inversely proportional to the distance - *E-step*: Given centroids, a data point is expected to belong to the closest cluster. - *M-step*: Given the cluster assignments, the algorithm adjusts the centroids to minimize the sum of their distance to assigned data points **K-means vs. Gaussian Mixture**: Assignment $P(C_j|x_i)$ $\begin{align*} \begin{array}{c|c} \text{K-means (hard clustering)} & \text{Gaussian Mixture (soft clustering)}\\\hline \text{1-hot entoding}&\text{distribution}\\ \text{Every data point belongs to exactly 1 cluster}&\text{Data points belong to multiple clusters } \end{array} \end{align*}$ - Gaussian Mixture can be viewed as a soft K-means. **Pros** of Gaussian mixture: - More general than partitioning - Clusters can be characterized by small number of params - Results may satisfy statistical assumptions for generative models **Cons** of Gaussian mixture: - Converges to local optimum -> solvable by running multiple times with difference initialization - Needs large datasets. Hard to estimate number of clusters. - Computationally expensive if the number of distributions is large. ## Readings - A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. 1977 - G. J. McLachlan and K. E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley & Sons, 1988 - K. Burnham and D. Anderson. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer Verlag, 2002 - C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006 - M. J. Zaki and W. Meira, Jr.. Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, 2014 - H. Deng and J. Han, Probabilistic Models for Clustering, in (Chapter 3) C. Aggarwal and C. K. Reddy (eds.),Data Clustering: Algorithms and Applications. CRC Press, 2014