## Overview
K-Nearest Neighbors (KNN) is a simple and intuitive supervised learning algorithm used for classification and regression. It classifies a data point based on the majority class of its nearest neighbors.
## Key Components
1. **K Value**: Number of nearest neighbors considered
2. **Distance Metric**: Determines how closeness is measured (e.g., Euclidean distance)
3. **Voting Mechanism**: Majority vote for classification, average for regression
## How It Works
1. **Choose K (Number of Neighbors)**
- A small K makes the model sensitive to noise
- A large K provides smoother decision boundaries
2. **Compute Distance**
- Uses Euclidean, Manhattan, or Minkowski distance
- Measures similarity between points
3. **Determine Nearest Neighbors**
- Sort points by distance
- Pick K closest data points
4. **Classify or Predict**
- Classification: Majority voting of nearest neighbors
- Regression: Average value of neighbors
## Implementation Example
```python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Initialize model
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
# Train model
knn.fit(X_train, y_train)
# Make predictions
predictions = knn.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.2f}')
```
## Advantages
- Simple and easy to implement
- No training time, as it is a lazy learner
- Works well with small datasets
- Can handle multi-class classification
## Disadvantages
- Computationally expensive for large datasets
- Sensitive to irrelevant features and noise
- Requires choosing an optimal K value
- Performance depends on data scaling
## Hyperparameters
1. **K Value**
- Small K → More variance, risk of overfitting
- Large K → More bias, risk of underfitting
2. **Distance Metric**
- **Euclidean**: Default, works well with continuous data
- **Manhattan**: Suitable for high-dimensional data
- **Minkowski**: Generalized metric combining both
3. **Weighting Strategy**
- Uniform: Equal weight to all neighbors
- Distance-based: Closer points have higher influence
## Best Practices
1. **Feature Scaling**
- Standardize or normalize data for distance-based calculations
2. **Choosing Optimal K**
- Use cross-validation to find the best value
3. **Handling Imbalanced Data**
- Use weighted KNN to balance class influence
## Common Applications
- Handwriting recognition
- Image classification
- Recommender systems
- Anomaly detection
- Medical diagnosis
## Performance Optimization
1. **Grid Search**
- Fine-tune K and distance metrics
- Optimize performance
2. **Dimensionality Reduction**
- Use PCA to remove irrelevant features
- Speed up computations
3. **KD-Tree / Ball-Tree**
- Efficient nearest neighbor search
- Reduces time complexity
## Evaluation Metrics
- Accuracy
- Precision
- Recall
- F1 Score
- ROC-AUC curve