Dimensionality Reduction - ML Pathway

Dimensionality reduction is a technique in machine learning and data analysis that reduces the number of input variables or features in a dataset while retaining as much of the important information as possible. The main goal is to simplify the data, improve computational efficiency, and reduce noise, which can help in visualising, interpreting, and training models on high-dimensional data. ![[DR.png]] When dealing with high-dimensional data (where each data point has many features), dimensionality reduction helps in: - **Reducing overfitting**: Fewer features can lead to simpler models, less prone to overfitting. - **Improving visualisation**: High-dimensional data can be hard to visualise, but by reducing the dimensions to 2 or 3, the data becomes easier to explore visually. - **Speeding up computation**: Lower-dimensional data requires less memory and can be processed faster by machine learning models. There are two main types of dimensionality reduction techniques: 1. **Feature Selection**: Involves selecting a subset of the most important features from the original dataset. This is often done based on statistical techniques or domain knowledge. 2. **Feature Extraction**: Involves creating new features by transforming the original features into a lower-dimensional space, typically through mathematical methods. Common dimensionality reduction techniques include: - **Principal Component Analysis (PCA)**: A linear technique that transforms the data into a new set of orthogonal axes (principal components) that capture the most variance in the data. PCA is widely used for reducing the dimensionality of large datasets while preserving as much of the data's variation as possible. - **Linear Discriminant Analysis (LDA)**: A supervised method that reduces dimensionality while preserving as much class discriminatory information as possible. It's mainly used when the data has labeled classes. - **t-Distributed Stochastic Neighbor Embedding (t-SNE)**: A non-linear technique often used for high-dimensional data visualisation, which maps data to 2 or 3 dimensions while preserving local structures. - **Auto encoders**: A type of neural network used for unsupervised learning of efficient codings by learning to map input data to a lower-dimensional space and then reconstruct it back to the original dimension. - **Independent Component Analysis (ICA)**: Similar to PCA but focuses on making the components statistically independent, which can be useful in certain applications like blind signal separation. Dimensionality reduction techniques help deal with the "curse of dimensionality," where the performance of algorithms may degrade as the number of dimensions increases. ## Dimensionality Reduction Resources ### Video Tutorials 1. **StatQuest with Josh Starmer** - [PCA Step-by-Step](https://www.youtube.com/watch?v=FgakZw6K1QQ): Clear, intuitive explanation of Principal Component Analysis with visual examples - [t-SNE, Clearly Explained](https://www.youtube.com/watch?v=NEaUSP4YerM): Simplified explanation of t-SNE for visualization of high-dimensional data 2. **Krish Naik's Dimensionality Reduction** - [Dimensionality Reduction Techniques](https://www.youtube.com/watch?v=QdBy02ExhGI): Practical overview of multiple techniques with Python implementation ### Books 1. **Feature Engineering for Machine Learning** - Authors: Alice Zheng & Amanda Casari - [O'Reilly Link](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/) - Dedicated chapters on dimensionality reduction with practical applications 2. **Python Machine Learning** - Author: Sebastian Raschka - [Packt Link](https://www.packtpub.com/product/python-machine-learning-third-edition/9781789955750) - Comprehensive coverage of PCA, LDA, and kernel-based dimensionality reduction ### Online Courses 1. **Coursera: Machine Learning - Dimensionality Reduction** - [Course Link](https://www.coursera.org/lecture/machine-learning/principal-component-analysis-problem-formulation-GBFTt) - Andrew Ng's clear explanation of PCA with mathematical foundations and applications 2. **DataCamp: Dimensionality Reduction in Python** - [Course Link](https://www.datacamp.com/courses/dimensionality-reduction-in-python) - Hands-on practice with PCA, t-SNE, and UMAP for data visualization and preprocessing ### Datasets for Practice 1. **MNIST Dataset** - Perfect for visualizing high-dimensional image data in 2D using t-SNE or PCA 2. **Gene Expression Cancer RNA-Seq** - Biological dataset with high dimensionality, ideal for demonstrating the power of dimensionality reduction 3. **Wine Dataset** - Small dataset with multiple features, good for beginners to practice PCA and LDA These resources focus specifically on dimensionality reduction techniques that are most commonly used in practice, with an emphasis on visualization and practical implementation.