## Overview
K-Means Clustering is an unsupervised machine learning algorithm used for clustering data into K distinct groups based on feature similarity. The goal is to minimize the variance within each cluster while maximizing the variance between different clusters. K-Means is widely used for tasks like customer segmentation, anomaly detection, and image compression.
## Key Components
1. **Centroids**
- The central point of a cluster, representing the average of all the points within the cluster.
2. **Clusters**
- Groups of similar data points, with each data point assigned to the nearest centroid.
3. **Iteration Process**
- K-Means uses an iterative process to assign data points to clusters and update the centroids until convergence.
4. **Euclidean Distance**
- The distance metric used to calculate the proximity between data points and centroids.
## How It Works
1. **Initialisation**
- Randomly choose K data points as the initial centroids of the K clusters.
2. **Cluster Assignment**
- Assign each data point to the nearest centroid based on a chosen distance metric (typically Euclidean distance).
3. **Update Centroids**
- After all points are assigned to clusters, update the centroids by calculating the mean of all points in each cluster.
4. **Repeat**
- Repeat the process of assigning points to clusters and updating centroids until the centroids no longer change or the maximum number of iterations is reached.
## Implementation Example
```python
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Example data (X is a feature matrix)
X = [[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11]]
# Initialize KMeans model with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)
# Fit model
kmeans.fit(X)
# Get the cluster centroids and labels
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
# Plot the results
plt.scatter([point[0] for point in X], [point[1] for point in X], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='red', marker='X')
plt.show()
```
## Advantages
- **Simple and Fast**: K-Means is a straightforward algorithm with relatively fast convergence.
- **Scalable**: Suitable for large datasets, especially when the number of clusters is small.
- **Works Well with Spherical Clusters**: Performs best when clusters are roughly spherical and of similar sizes.
- **Versatile**: Used for various tasks like image compression, customer segmentation, and anomaly detection.
## Disadvantages
- **Choosing K**: The number of clusters (K) must be predefined, and selecting the optimal K can be challenging.
- **Sensitivity to Initial Centroids**: The final clusters may depend on the initial random selection of centroids, leading to different results on different runs.
- **Assumes Spherical Clusters**: K-Means works best when clusters are spherical and roughly of the same size. It may struggle with irregularly shaped clusters.
- **Sensitive to Outliers**: Outliers can disproportionately affect the centroid location and the overall clustering result.
## Hyperparameters
1. **Number of Clusters (`n_clusters`)**
- The number of clusters (K) to form. It must be specified beforehand. Methods like the Elbow method or Silhouette score can help determine an optimal value for K.
2. **Max Iterations (`max_iter`)**
- The maximum number of iterations the algorithm will run before stopping. The algorithm stops early if the centroids don’t change significantly between iterations.
3. **Initialization Method (`init`)**
- Determines how the initial centroids are chosen. Options include:
- `k-means++`: A smarter initialization method that selects initial centroids to speed up convergence.
- `random`: Randomly selects the initial centroids.
4. **Tolerance (`tol`)**
- The threshold for determining whether the algorithm has converged. If the change in centroids is below this threshold, the algorithm stops.
5. **Algorithm (`algorithm`)**
- The method used to compute the clusters. Options include `auto`, `full`, and `elkan`, where `elkan` is faster for dense datasets.
## Best Practices
1. **Choosing K**
- Use methods like the Elbow Method, Silhouette Score, or Gap Statistic to determine the optimal number of clusters.
2. **Scaling Features**
- Standardise or normalise your data before applying K-Means, especially when the features are measured on different scales, as distance metrics can be sensitive to scale.
3. **Multiple Initialisations**
- Run K-Means with different initialisations (`n_init > 1`) to reduce the impact of random initialisation and increase the likelihood of finding the global optimum.
4. **Avoiding Local Minima**
- To mitigate the sensitivity to initial centroids, use `k-means++` initialisation or run the algorithm multiple times with different initialisations.
## Common Applications
- **Customer Segmentation**
- **Anomaly Detection**
- **Image Compression**
- **Market Basket Analysis**
- **Document Clustering**
## Performance Optimization
1. **Choose a Good K**
- Use techniques like the Elbow Method or Silhouette Score to select an appropriate number of clusters.
2. **Dimensionality Reduction**
- If working with high-dimensional data, consider applying PCA (Principal Component Analysis) to reduce the number of features and improve K-Means performance.
3. **Scaling**
- Feature scaling is crucial in K-Means. Apply Min-Max Scaling or Standard Scaling to the data before clustering.
## Evaluation Metrics
- **Inertia**: Measures the sum of squared distances from each point to its assigned cluster’s centroid. Lower inertia indicates better clustering.
- **Silhouette Score**: Measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.
- **Calinski-Harabasz Index**: A metric used to evaluate the quality of clusters based on the dispersion of clusters.
Check further: [[DBSCAN Clustering]]