3_regression_regularization - Obsidian Publish

# Regression and Regularization Slides: [[lecture4_linear_models.pdf|regression]], [[lecture5_overfitting.pdf|regularization]] ## 1 Ordinary Least Square (OLS) $\begin{align*} \begin{array}{c|c} \text{Predictions}&\hat{Y}=X\mathbf{w}+b\\ \text{Loss (SSE)}& L=\frac{1}{2}\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2\\ \text{Closed-form sol.} & \mathbf{w}=(X^TX)^{-1}X^TY\\ \text{Gradient}& \nabla_w L=-X^T(Y-\hat{Y}) \end{array} \end{align*}$ ### 1.1 Caveats 1. **Discrete $x$ features**: One-hot encoding. If one-hot vector has high dimension, consolidate infrequent classes to "others" 2. **Missing values**: Interpolation using mean/median/etc. Or build model based on other features. 3. **Normalization**: $\begin{align*} \begin{array}{c|cc} \text{z-score}&x=&\frac{x-\mu}{\sigma}\\ \text{min-max}&x=&\frac{x-x_{\min}}{x_{\max}-x_{\min}} \end{array} \end{align*}$ 4. **Non-linear relationships**: Create an artificial feature $\begin{align*} \text{Weight}=b+w_1\cdot\text{height}+w_2\cdot\mathbf{\textbf{height}^2} \end{align*}$ ## 2 Logistic Regression $\begin{align*} \begin{array}{c|c} \text{Prediction} & \widehat{\boldsymbol{y}}_{\boldsymbol{i}}=\sigma\left(\sum_{j=1}^d x_{i, j} \boldsymbol{w}_j+b\right)\\ \text{Loss (Cross entropy)} & l\left(\boldsymbol{y}_i, \widehat{\boldsymbol{y}}_i\right)=-\sum_{j=1}^k y_{i, j} \log \hat{y}_{i, j}\\ \text{Cradient} & \frac{\partial l\left(y_i, \widehat{y}_i\right)}{\partial w_j} = \end{array} \end{align*}$ - ***Sigmoid function***: $\sigma(x)=\frac{e^x}{e^x+1}=\frac{1}{1+e^{-x}}$ - $\boldsymbol{y}_{i, j} \in\{0,1\}$, $\widehat{\boldsymbol{y}}_{i, j} \in[0,1]$ ## 3 Regularization ![[3_regression_regularization 2023-04-20 09.39.40.excalidraw.svg|700]] ```python ''' Sklearn Ridge regression ''' ridge = linear_model.Ridge().fit(X_train,y_train) loss = RMSE(ridge.predict(X_test), y_test) ''' Sklearn LASSO regression ''' lasso = linear_model.Lasso().fit(X_train,y_train) loss = RMSE(lasso.predict(X_test), y_test) ``` - Ridge regression leads to small weights - Lasso regression drives weights to zero (generally worse than ridge) ***Ridge regression***: $\begin{align*} \begin{array}{c|c} \text{Loss} & L= \sum_{i=1}^n l\left(y_i, \hat{y}_i\right)+\beta *\Vert \boldsymbol{w} \Vert_2^2\\ \text{Closed-form solution} & w=\left(\boldsymbol{X}^{\boldsymbol{T}} \boldsymbol{X}+\beta \boldsymbol{I}\right)^{-1} \boldsymbol{X}^{\boldsymbol{T}} \boldsymbol{Y}\\ \text{Gradient} & \nabla_w L= 2X^T(Xw-Y)+2 \beta w \end{array} \end{align*}$ $\nabla_w L= 2X^T(Xw-Y)+2 \beta w=0 \Longleftrightarrow (X^TX+\beta I) w=X^TY$ ***Lasso regression***: $\begin{align*} \begin{array}{c|c} \text{Loss} & L= \sum_{i=1}^n l\left(y_i, \hat{y}_i\right)+\alpha *\Vert \boldsymbol{w} \Vert_1\\ \text{Closed-form solution} & \text{N/A} \end{array} \end{align*}$ ***Elastic net***: Combine ridge and lasso for feature selection $\begin{align*} L=\sum_{i=1}^n l\left(y_i, \hat{y}_i\right)+\alpha *||\boldsymbol{w}||_1+\beta||\boldsymbol{w}||_2^2 \end{align*}$ ![[Pasted image 20230420102035.png|300]]