## Overview
Random Forest Regression is an ensemble machine learning method that uses a collection of decision trees to make predictions. Each tree in the forest predicts a value, and the final prediction is the average of all the trees’ predictions. It helps reduce overfitting, increases accuracy, and improves robustness compared to a single decision tree.
## Key Components
1. **Multiple Decision Trees**
- Random Forest builds many decision trees, each trained on a random subset of the data using bootstrapping (sampling with replacement).
2. **Random Feature Selection**
- At each node, a random subset of features is considered for splitting, which helps improve diversity among the trees.
3. **Prediction Aggregation**
- For regression tasks, the predictions of all individual trees are averaged to obtain the final prediction.
## How It Works
1. **Bootstrapping**
- Random subsets of data are selected (with replacement) to train each tree in the forest.
2. **Tree Construction**
- Each decision tree is trained independently using the bootstrapped data and random feature selection.
3. **Prediction**
- When making predictions, the input data is passed through all the trees in the forest. Each tree provides a prediction, and the final output is the average of all tree predictions.
## Implementation Example
```python
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Load data (assuming X_train, X_test, y_train, y_test are predefined)
# Initialize Random Forest Regressor model
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
# Train model
rf_regressor.fit(X_train, y_train)
# Make predictions
predictions = rf_regressor.predict(X_test)
# Evaluate model performance
mse = mean_squared_error(y_test, predictions)
print(f'Mean Squared Error: {mse:.2f}')
```
## Advantages
- **Reduces Overfitting**: Combines multiple decision trees to reduce the risk of overfitting that a single decision tree might have.
- **Improves Accuracy**: Aggregating predictions from many trees leads to more accurate predictions.
- **Handles Non-linearity**: Can model complex relationships in the data.
- **Robust to Outliers**: Less sensitive to noisy data compared to a single decision tree.
- **Works with both Numerical and Categorical Data**: No need for scaling or normalization of data.
## Disadvantages
- **Computationally Intensive**: Building multiple trees can require significant computational resources and memory, especially for large datasets.
- **Difficult to Interpret**: Unlike a single decision tree, the model is harder to interpret due to the ensemble nature of the trees.
- **Slow Prediction**: Making predictions can be slower compared to simpler models since each prediction requires evaluating many trees.
## Hyperparameters
1. **Number of Trees (`n_estimators`)**
- The number of decision trees in the forest. More trees typically lead to better performance but increase computational cost.
2. **Maximum Depth (`max_depth`)**
- The maximum depth of each tree. Limiting depth helps prevent overfitting.
3. **Minimum Samples Split (`min_samples_split`)**
- The minimum number of samples required to split an internal node. This helps control the size of the tree and reduces overfitting.
4. **Minimum Samples Leaf (`min_samples_leaf`)**
- The minimum number of samples required to be at a leaf node. This ensures that leaf nodes contain a sufficient number of data points.
5. **Maximum Features (`max_features`)**
- The maximum number of features to consider when splitting a node. Using fewer features at each split increases tree diversity.
## Best Practices
1. **Hyperparameter Tuning**
- Use Grid Search or Randomised Search to find optimal hyper parameters like `n_estimators`, `max_depth`, and `min_samples_split`.
1. **Feature Engineering**
- Remove irrelevant or highly correlated features, as they can reduce model performance and increase complexity.
1. **Cross-Validation**
- Use k-fold cross-validation to evaluate model performance and ensure good generalization.
## Common Applications
- **Predicting stock prices**
- **Housing price prediction**
- **Sales forecasting**
- **Energy consumption forecasting**
- **Medical prediction tasks** (e.g., predicting disease outcomes)
## Performance Optimisation
1. **Use More Trees**
- Increasing the number of trees generally leads to more robust and accurate models, but at the cost of increased computation.
1. **Parallelisation**
- Random forests support parallel training, which can be enabled by setting the `n_jobs=-1` parameter for faster training.
1. **Feature Selection**
- Consider removing irrelevant features to reduce model complexity and computation time.
## Evaluation Metrics
- **Mean Squared Error (MSE)**: Measures the average squared difference between the predicted and actual values.
- **Root Mean Squared Error (RMSE)**: The square root of the MSE, providing an interpretation in the same units as the target variable.
- **R² Score**: Measures how well the model explains the variance in the data. A value closer to 1 indicates better performance.