Random Forest Regression - ML Pathway

## Overview Random Forest Regression is an ensemble machine learning method that uses a collection of decision trees to make predictions. Each tree in the forest predicts a value, and the final prediction is the average of all the trees’ predictions. It helps reduce overfitting, increases accuracy, and improves robustness compared to a single decision tree. ## Key Components 1. **Multiple Decision Trees** - Random Forest builds many decision trees, each trained on a random subset of the data using bootstrapping (sampling with replacement). 2. **Random Feature Selection** - At each node, a random subset of features is considered for splitting, which helps improve diversity among the trees. 3. **Prediction Aggregation** - For regression tasks, the predictions of all individual trees are averaged to obtain the final prediction. ## How It Works 1. **Bootstrapping** - Random subsets of data are selected (with replacement) to train each tree in the forest. 2. **Tree Construction** - Each decision tree is trained independently using the bootstrapped data and random feature selection. 3. **Prediction** - When making predictions, the input data is passed through all the trees in the forest. Each tree provides a prediction, and the final output is the average of all tree predictions. ## Implementation Example ```python from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Load data (assuming X_train, X_test, y_train, y_test are predefined) # Initialize Random Forest Regressor model rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42) # Train model rf_regressor.fit(X_train, y_train) # Make predictions predictions = rf_regressor.predict(X_test) # Evaluate model performance mse = mean_squared_error(y_test, predictions) print(f'Mean Squared Error: {mse:.2f}') ``` ## Advantages - **Reduces Overfitting**: Combines multiple decision trees to reduce the risk of overfitting that a single decision tree might have. - **Improves Accuracy**: Aggregating predictions from many trees leads to more accurate predictions. - **Handles Non-linearity**: Can model complex relationships in the data. - **Robust to Outliers**: Less sensitive to noisy data compared to a single decision tree. - **Works with both Numerical and Categorical Data**: No need for scaling or normalization of data. ## Disadvantages - **Computationally Intensive**: Building multiple trees can require significant computational resources and memory, especially for large datasets. - **Difficult to Interpret**: Unlike a single decision tree, the model is harder to interpret due to the ensemble nature of the trees. - **Slow Prediction**: Making predictions can be slower compared to simpler models since each prediction requires evaluating many trees. ## Hyperparameters 1. **Number of Trees (`n_estimators`)** - The number of decision trees in the forest. More trees typically lead to better performance but increase computational cost. 2. **Maximum Depth (`max_depth`)** - The maximum depth of each tree. Limiting depth helps prevent overfitting. 3. **Minimum Samples Split (`min_samples_split`)** - The minimum number of samples required to split an internal node. This helps control the size of the tree and reduces overfitting. 4. **Minimum Samples Leaf (`min_samples_leaf`)** - The minimum number of samples required to be at a leaf node. This ensures that leaf nodes contain a sufficient number of data points. 5. **Maximum Features (`max_features`)** - The maximum number of features to consider when splitting a node. Using fewer features at each split increases tree diversity. ## Best Practices 1. **Hyperparameter Tuning** - Use Grid Search or Randomised Search to find optimal hyper parameters like `n_estimators`, `max_depth`, and `min_samples_split`. 1. **Feature Engineering** - Remove irrelevant or highly correlated features, as they can reduce model performance and increase complexity. 1. **Cross-Validation** - Use k-fold cross-validation to evaluate model performance and ensure good generalization. ## Common Applications - **Predicting stock prices** - **Housing price prediction** - **Sales forecasting** - **Energy consumption forecasting** - **Medical prediction tasks** (e.g., predicting disease outcomes) ## Performance Optimisation 1. **Use More Trees** - Increasing the number of trees generally leads to more robust and accurate models, but at the cost of increased computation. 1. **Parallelisation** - Random forests support parallel training, which can be enabled by setting the `n_jobs=-1` parameter for faster training. 1. **Feature Selection** - Consider removing irrelevant features to reduce model complexity and computation time. ## Evaluation Metrics - **Mean Squared Error (MSE)**: Measures the average squared difference between the predicted and actual values. - **Root Mean Squared Error (RMSE)**: The square root of the MSE, providing an interpretation in the same units as the target variable. - **R² Score**: Measures how well the model explains the variance in the data. A value closer to 1 indicates better performance.