Hierarchical Clustering - ML Pathway

## Overview Hierarchical Clustering is an unsupervised machine learning algorithm that builds a hierarchy of clusters either by iteratively merging smaller clusters (agglomerative) or by iteratively splitting larger clusters (divisive). This method is useful for discovering nested groupings of data, and the results can be visualised as a tree-like diagram called a **Dendrogram**. ## Key Components 1. **Agglomerative Clustering** - This is the most common approach where each data point starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy. 2. **Divisive Clustering** - The opposite of agglomerative clustering, starting with all points in one cluster and recursively splitting it into smaller clusters. 3. **Dendrogram** - A tree diagram that visually represents the arrangement of clusters at various levels of similarity. 4. **Linkage Methods** - Defines how the distance between clusters is calculated during the merging or splitting process. Common methods include: - **Single Linkage**: Distance between the closest points of two clusters. - **Complete Linkage**: Distance between the farthest points of two clusters. - **Average Linkage**: Average distance between all points in the two clusters. - **Ward’s Linkage**: Minimizes the variance within clusters when merging. ## How It Works 1. **Start with Each Data Point as a Cluster** - Initially, each data point is treated as its own cluster. 2. **Measure Pairwise Distances** - A distance measure (usually Euclidean) is calculated for every pair of clusters. 3. **Merge Closest Clusters (Agglomerative) or Split (Divisive)** - The algorithm merges or splits clusters based on the closest proximity or highest dissimilarity. For agglomerative, the closest clusters are merged first. 4. **Repeat Until a Stopping Condition is Met** - The algorithm continues merging or splitting clusters until all points are in one cluster (agglomerative) or until the desired number of clusters is achieved. 5. **Generate Dendrogram** - The process is visualised as a dendrogram to show the cluster hierarchy. ## Implementation Example ```python from sklearn.cluster import AgglomerativeClustering import numpy as np import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Example data (X is a feature matrix) X = np.array([[1, 2], [1.5, 1.8], [5, 8], [8, 8], [1, 0.6], [9, 11], [5.5, 9]]) # Create model agg_clustering = AgglomerativeClustering(n_clusters=3, linkage='ward') # Fit model labels = agg_clustering.fit_predict(X) # Plot results plt.scatter([point[0] for point in X], [point[1] for point in X], c=labels, cmap='viridis') plt.title('Agglomerative Clustering') plt.show() # Create and plot dendrogram linked = linkage(X, 'ward') plt.figure(figsize=(10, 7)) dendrogram(linked) plt.title('Dendrogram') plt.show() ``` ## Advantages - **No Need to Predefine Number of Clusters**: Unlike [[K-Means Clustering]], hierarchical clustering doesn't require the number of clusters to be specified upfront. - **Visualizable**: The dendrogram provides a clear visualization of the hierarchy and how clusters are merged or split. - **Handles Non-Spherical Clusters**: It can handle clusters of different shapes, unlike K-Means, which assumes spherical clusters. - **Works Well with Small Datasets**: Suitable for smaller datasets where the computational cost of building the hierarchical tree is not prohibitive. ## Disadvantages - **Computationally Expensive**: Hierarchical clustering can be quite slow for large datasets due to the need to calculate pairwise distances between all data points. - **Sensitive to Noise**: Outliers can significantly affect the results as there is no mechanism like DBSCAN for dealing with noise. - **Doesn't Scale Well**: Its complexity increases rapidly as the size of the dataset grows (O(n^2) time complexity). - **Doesn't Handle Well with High Dimensionality**: It can become less effective when the number of features is very large. ## Hyperparameters 1. **n_clusters** - The number of clusters to form. If not specified, the algorithm will create a hierarchical structure that can be cut at different levels. 2. **linkage** - Determines how the distance between clusters is computed. Options include: - 'ward' (minimizes the variance within clusters) - 'complete' (farthest points) - 'single' (closest points) - 'average' (average distance) 3. **affinity** - Metric used to compute the pairwise distances. Common values are 'euclidean', 'manhattan', etc. 4. **compute_full_tree** - Whether to compute the full tree or stop once the desired number of clusters is achieved. ## Best Practices 1. **Choosing the Right Linkage Method** - Experiment with different linkage methods (e.g., ward, single, complete) to see which works best for your data. Ward’s linkage is often a good default choice as it minimises variance. 2. **Scaling Data** - Before applying hierarchical clustering, it’s a good idea to standardise or normalise the data to ensure that each feature contributes equally to the distance metric. 3. **Using Dendrograms** - Visualise the dendrogram to help determine the optimal number of clusters by "cutting" the dendrogram at a specific height. 4. **Handling Noise and Outliers** - Preprocess data by removing outliers or using robust distance measures to avoid their effect on the clustering results. ## Common Applications - **Gene Expression Clustering** - Grouping genes with similar expression profiles, useful in biological research. - **Market Segmentation** - Clustering customers with similar purchasing behaviours to identify distinct customer groups. - **Document Clustering** - Grouping text documents into similar categories or topics based on content. - **Anomaly Detection** - Identifying outlier behaviour in datasets, useful in fraud detection or system monitoring. ## Performance Optimisation 1. **Efficient Distance Computation** - Use approximate methods or dimensionality reduction techniques (e.g., PCA) for faster distance calculations when dealing with large datasets. 2. **Data Pruning** - If the dataset contains irrelevant features or outliers, remove them to speed up clustering and improve results. 3. **Parallelisation** - For very large datasets, consider parallelising distance calculations or using optimized libraries that implement hierarchical clustering more efficiently. ## Evaluation Metrics - **Silhouette Score**: Measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better clustering. - **Cophenetic Correlation Coefficient**: Measures how well the dendrogram represents the distances between the data points. - **Dunn Index**: Measures the compactness and separation of clusters. Higher values are better.