K-Means Clustering - ML Pathway

## Overview K-Means Clustering is an unsupervised machine learning algorithm used for clustering data into K distinct groups based on feature similarity. The goal is to minimize the variance within each cluster while maximizing the variance between different clusters. K-Means is widely used for tasks like customer segmentation, anomaly detection, and image compression. ## Key Components 1. **Centroids** - The central point of a cluster, representing the average of all the points within the cluster. 2. **Clusters** - Groups of similar data points, with each data point assigned to the nearest centroid. 3. **Iteration Process** - K-Means uses an iterative process to assign data points to clusters and update the centroids until convergence. 4. **Euclidean Distance** - The distance metric used to calculate the proximity between data points and centroids. ## How It Works 1. **Initialisation** - Randomly choose K data points as the initial centroids of the K clusters. 2. **Cluster Assignment** - Assign each data point to the nearest centroid based on a chosen distance metric (typically Euclidean distance). 3. **Update Centroids** - After all points are assigned to clusters, update the centroids by calculating the mean of all points in each cluster. 4. **Repeat** - Repeat the process of assigning points to clusters and updating centroids until the centroids no longer change or the maximum number of iterations is reached. ## Implementation Example ```python from sklearn.cluster import KMeans import matplotlib.pyplot as plt # Example data (X is a feature matrix) X = [[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]] # Initialize KMeans model with 2 clusters kmeans = KMeans(n_clusters=2, random_state=42) # Fit model kmeans.fit(X) # Get the cluster centroids and labels centroids = kmeans.cluster_centers_ labels = kmeans.labels_ # Plot the results plt.scatter([point[0] for point in X], [point[1] for point in X], c=labels, cmap='viridis') plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='red', marker='X') plt.show() ``` ## Advantages - **Simple and Fast**: K-Means is a straightforward algorithm with relatively fast convergence. - **Scalable**: Suitable for large datasets, especially when the number of clusters is small. - **Works Well with Spherical Clusters**: Performs best when clusters are roughly spherical and of similar sizes. - **Versatile**: Used for various tasks like image compression, customer segmentation, and anomaly detection. ## Disadvantages - **Choosing K**: The number of clusters (K) must be predefined, and selecting the optimal K can be challenging. - **Sensitivity to Initial Centroids**: The final clusters may depend on the initial random selection of centroids, leading to different results on different runs. - **Assumes Spherical Clusters**: K-Means works best when clusters are spherical and roughly of the same size. It may struggle with irregularly shaped clusters. - **Sensitive to Outliers**: Outliers can disproportionately affect the centroid location and the overall clustering result. ## Hyperparameters 1. **Number of Clusters (`n_clusters`)** - The number of clusters (K) to form. It must be specified beforehand. Methods like the Elbow method or Silhouette score can help determine an optimal value for K. 2. **Max Iterations (`max_iter`)** - The maximum number of iterations the algorithm will run before stopping. The algorithm stops early if the centroids don’t change significantly between iterations. 3. **Initialization Method (`init`)** - Determines how the initial centroids are chosen. Options include: - `k-means++`: A smarter initialization method that selects initial centroids to speed up convergence. - `random`: Randomly selects the initial centroids. 4. **Tolerance (`tol`)** - The threshold for determining whether the algorithm has converged. If the change in centroids is below this threshold, the algorithm stops. 5. **Algorithm (`algorithm`)** - The method used to compute the clusters. Options include `auto`, `full`, and `elkan`, where `elkan` is faster for dense datasets. ## Best Practices 1. **Choosing K** - Use methods like the Elbow Method, Silhouette Score, or Gap Statistic to determine the optimal number of clusters. 2. **Scaling Features** - Standardise or normalise your data before applying K-Means, especially when the features are measured on different scales, as distance metrics can be sensitive to scale. 3. **Multiple Initialisations** - Run K-Means with different initialisations (`n_init > 1`) to reduce the impact of random initialisation and increase the likelihood of finding the global optimum. 4. **Avoiding Local Minima** - To mitigate the sensitivity to initial centroids, use `k-means++` initialisation or run the algorithm multiple times with different initialisations. ## Common Applications - **Customer Segmentation** - **Anomaly Detection** - **Image Compression** - **Market Basket Analysis** - **Document Clustering** ## Performance Optimization 1. **Choose a Good K** - Use techniques like the Elbow Method or Silhouette Score to select an appropriate number of clusters. 2. **Dimensionality Reduction** - If working with high-dimensional data, consider applying PCA (Principal Component Analysis) to reduce the number of features and improve K-Means performance. 3. **Scaling** - Feature scaling is crucial in K-Means. Apply Min-Max Scaling or Standard Scaling to the data before clustering. ## Evaluation Metrics - **Inertia**: Measures the sum of squared distances from each point to its assigned cluster’s centroid. Lower inertia indicates better clustering. - **Silhouette Score**: Measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters. - **Calinski-Harabasz Index**: A metric used to evaluate the quality of clusters based on the dispersion of clusters. Check further: [[DBSCAN Clustering]]