K-Nearest Neighbors Classification - ML Pathway

## Overview K-Nearest Neighbors (KNN) is a simple and intuitive supervised learning algorithm used for classification and regression. It classifies a data point based on the majority class of its nearest neighbors. ## Key Components 1. **K Value**: Number of nearest neighbors considered 2. **Distance Metric**: Determines how closeness is measured (e.g., Euclidean distance) 3. **Voting Mechanism**: Majority vote for classification, average for regression ## How It Works 1. **Choose K (Number of Neighbors)** - A small K makes the model sensitive to noise - A large K provides smoother decision boundaries 2. **Compute Distance** - Uses Euclidean, Manhattan, or Minkowski distance - Measures similarity between points 3. **Determine Nearest Neighbors** - Sort points by distance - Pick K closest data points 4. **Classify or Predict** - Classification: Majority voting of nearest neighbors - Regression: Average value of neighbors ## Implementation Example ```python from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Initialize model knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean') # Train model knn.fit(X_train, y_train) # Make predictions predictions = knn.predict(X_test) accuracy = accuracy_score(y_test, predictions) print(f'Accuracy: {accuracy:.2f}') ``` ## Advantages - Simple and easy to implement - No training time, as it is a lazy learner - Works well with small datasets - Can handle multi-class classification ## Disadvantages - Computationally expensive for large datasets - Sensitive to irrelevant features and noise - Requires choosing an optimal K value - Performance depends on data scaling ## Hyperparameters 1. **K Value** - Small K → More variance, risk of overfitting - Large K → More bias, risk of underfitting 2. **Distance Metric** - **Euclidean**: Default, works well with continuous data - **Manhattan**: Suitable for high-dimensional data - **Minkowski**: Generalized metric combining both 3. **Weighting Strategy** - Uniform: Equal weight to all neighbors - Distance-based: Closer points have higher influence ## Best Practices 1. **Feature Scaling** - Standardize or normalize data for distance-based calculations 2. **Choosing Optimal K** - Use cross-validation to find the best value 3. **Handling Imbalanced Data** - Use weighted KNN to balance class influence ## Common Applications - Handwriting recognition - Image classification - Recommender systems - Anomaly detection - Medical diagnosis ## Performance Optimization 1. **Grid Search** - Fine-tune K and distance metrics - Optimize performance 2. **Dimensionality Reduction** - Use PCA to remove irrelevant features - Speed up computations 3. **KD-Tree / Ball-Tree** - Efficient nearest neighbor search - Reduces time complexity ## Evaluation Metrics - Accuracy - Precision - Recall - F1 Score - ROC-AUC curve