Ridge regression is used to address multicollinearity in a linear model by introducing a [[regularization]] term to the regression equation, similar to [[lasso regression]]. Ridge regression sets a constraint on the estimates of $\vec \beta$, which introduces bias and thus reduces variance (recall the [[bias-variance tradeoff]]). In ridge regression, the sum of squared residuals, which is minimized in [[OLS]] regression, is modified by adding a penalty term known as the ridge penalty, **shrinkage penalty**, or L2 regularization term. The ridge penalty is calculated by multiplying the square of the coefficient values by a tuning parameter, typically denoted as lambda ($\lambda$). The larger the value of lambda, the greater the impact of the ridge penalty. Ridge regression is fit by minimizing the quantity $ RSS + \lambda \sum_{j=1}^pB^2_j$ By adding the ridge penalty to the regression equation, ridge regression reduces the coefficients of the correlated variables towards zero while still maintaining them in the model. This helps to reduce the impact of multicollinearity and stabilize the regression model. However, unlike variable selection techniques such as backward elimination or forward selection, ridge regression does not eliminate variables entirely but rather shrinks their coefficients towards zero. The ridge penalty has the effect of reducing the variance of the regression coefficients, but it does not set them to exactly zero. This means that even less important variables can still have some influence on the response variable. The amount of shrinkage applied to the coefficients depends on the value of lambda, with larger lambda values resulting in more shrinkage. Ridge regression is particularly useful when dealing with datasets that have a large number of correlated predictors. It helps to improve the predictive accuracy of the model by reducing the impact of multicollinearity and preventing overfitting. The choice of the optimal lambda value is often determined through cross-validation techniques to strike a balance between model simplicity and performance. The following reading will give you a strong insight as to what Ridge Regression is and how it can be a powerful tool in machine learning. The constraint on $\vec \beta$ is derived from its L2 norm and acts as a penalty term. Use the Method of the Lagrange Multiplier to solve the ridge regression. Set up a range for $\lambda$ and select the best value based on the model's [[coefficient of determination|R-squared]] value. If the $\vec \beta$ are very small for all models, it may indicate that the range for $\lambda$ is too high. Use the `glmnet` package to perform ridge regression in [[R]]. The `mixture` argugment specifies the amount of different types of regularization. Set `mixture=0` for ridge regression, `mixture=1` for [[lasso regression]], and something in between to use both L2 and L1 regularization. The below code requires `tidymodels`. ```R ridge_spec <- linear_reg(mixture = 0, penalty = 0) %>% set_mode("regression") %>% set_engine("glmnet") ridge_fit <- fit(ridge_spec, y ~ x, data=data) tidy(ridge_fit) ``` Below is a recipe for using basic grid search for tuning the `penalty` hyperparameter using the `Hitters` dataset from `ISLR`. ```R library(ISLR) library(tidymodels) # Load data Hitters <- as_tibble(Hitters) Hitters <- na.omit(Hitters) # Split data Hitters_split <- initial_split(Hitters, strata = "Salary") Hitters_train <- training(Hitters_split) Hitters_test <- testing(Hitters_split) Hitters_fold <- vfold_cv(Hitters_train, v = 10) # Specify recipe ridge_recipe <- recipe(formula = Salary ~ ., data = Hitters_train) %>% step_novel(all_nominal_predictors()) %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_normalize(all_predictors()) # Model specification ridge_spec <- linear_reg(mixture = 0, penalty = tune()) %>% set_mode("regression") %>% set_engine("glmnet") # Set workflow ridge_workflow <- workflow() %>% add_recipe(ridge_recipe) %>% add_model(ridge_spec) # Create hyperparameter grid penalty_grid <- grid_regular(penalty(range = c(-5, 5)), levels=50) # set reasonable range, note range is in log scale # Fit all models tune_res <- tune_grid( ridge_workflow, resamples = Hitters_fold, grid = penalty_grid ) # Plot accuracy over grid space autoplot(tune_res) # Get best penalty using RSQ best_penalty <- select_best(tune_res, metric="rsq") # Finalize the recipe with best penalty ridge_final <- finalize_workflow(ridge_workflow, best_penalty) ridge_final_fit <- fit(ridge_final, data = Hitters_train) # Validate performance on testing data augment(ridge_final_fit, new_data = Hitters_test) %>% rsq(truth = Salary, estimate = .pred) ``` > [!Tip]- Additional Resources > - [Data Aspirant - Mastering Ridge Regression: Comprehensive Guide and Practical Applications](https://dataaspirant.com/ridge-regression/) > - [Pluralsight - Linear, Lasso and Ridge Regression with R](https://www.pluralsight.com/resources/blog/guides/linear-lasso-and-ridge-regression-with-r) > - [datacamp - Regularization in R Tutorial: Ridge, Lass and Elastic Net](https://www.datacamp.com/tutorial/tutorial-ridge-lasso-elastic-net) > - [Balancing Performance and Interpretability: Selecting Features with Bootstrapped Ridge Regression(https://pmc.ncbi.nlm.nih.gov/articles/PMC6371276/)]