# Regression and Regularization
Slides: [[lecture4_linear_models.pdf|regression]], [[lecture5_overfitting.pdf|regularization]]
## 1 Ordinary Least Square (OLS)
$\begin{align*}
\begin{array}{c|c}
\text{Predictions}&\hat{Y}=X\mathbf{w}+b\\
\text{Loss (SSE)}& L=\frac{1}{2}\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2\\
\text{Closed-form sol.} & \mathbf{w}=(X^TX)^{-1}X^TY\\
\text{Gradient}& \nabla_w L=-X^T(Y-\hat{Y})
\end{array}
\end{align*}$
### 1.1 Caveats
1. **Discrete $x$ features**: One-hot encoding. If one-hot vector has high dimension, consolidate infrequent classes to "others"
2. **Missing values**: Interpolation using mean/median/etc. Or build model based on other features.
3. **Normalization**:
$\begin{align*}
\begin{array}{c|cc}
\text{z-score}&x=&\frac{x-\mu}{\sigma}\\
\text{min-max}&x=&\frac{x-x_{\min}}{x_{\max}-x_{\min}}
\end{array}
\end{align*}$
4. **Non-linear relationships**: Create an artificial feature
$\begin{align*}
\text{Weight}=b+w_1\cdot\text{height}+w_2\cdot\mathbf{\textbf{height}^2}
\end{align*}$
## 2 Logistic Regression
$\begin{align*}
\begin{array}{c|c}
\text{Prediction} & \widehat{\boldsymbol{y}}_{\boldsymbol{i}}=\sigma\left(\sum_{j=1}^d x_{i, j} \boldsymbol{w}_j+b\right)\\
\text{Loss (Cross entropy)} & l\left(\boldsymbol{y}_i, \widehat{\boldsymbol{y}}_i\right)=-\sum_{j=1}^k y_{i, j} \log \hat{y}_{i, j}\\
\text{Cradient} & \frac{\partial l\left(y_i, \widehat{y}_i\right)}{\partial w_j} =
\end{array}
\end{align*}$
- ***Sigmoid function***: $\sigma(x)=\frac{e^x}{e^x+1}=\frac{1}{1+e^{-x}}$
- $\boldsymbol{y}_{i, j} \in\{0,1\}$, $\widehat{\boldsymbol{y}}_{i, j} \in[0,1]$
## 3 Regularization
![[3_regression_regularization 2023-04-20 09.39.40.excalidraw.svg|700]]
```python
''' Sklearn Ridge regression '''
ridge = linear_model.Ridge().fit(X_train,y_train)
loss = RMSE(ridge.predict(X_test), y_test)
''' Sklearn LASSO regression '''
lasso = linear_model.Lasso().fit(X_train,y_train)
loss = RMSE(lasso.predict(X_test), y_test)
```
- Ridge regression leads to small weights
- Lasso regression drives weights to zero (generally worse than ridge)
***Ridge regression***:
$\begin{align*}
\begin{array}{c|c}
\text{Loss} & L= \sum_{i=1}^n l\left(y_i, \hat{y}_i\right)+\beta *\Vert \boldsymbol{w} \Vert_2^2\\
\text{Closed-form solution} & w=\left(\boldsymbol{X}^{\boldsymbol{T}} \boldsymbol{X}+\beta \boldsymbol{I}\right)^{-1} \boldsymbol{X}^{\boldsymbol{T}} \boldsymbol{Y}\\
\text{Gradient} & \nabla_w L= 2X^T(Xw-Y)+2 \beta w
\end{array}
\end{align*}$
$\nabla_w L= 2X^T(Xw-Y)+2 \beta w=0 \Longleftrightarrow (X^TX+\beta I) w=X^TY$
***Lasso regression***:
$\begin{align*}
\begin{array}{c|c}
\text{Loss} & L= \sum_{i=1}^n l\left(y_i, \hat{y}_i\right)+\alpha *\Vert \boldsymbol{w} \Vert_1\\
\text{Closed-form solution} & \text{N/A}
\end{array}
\end{align*}$
***Elastic net***: Combine ridge and lasso for feature selection
$\begin{align*}
L=\sum_{i=1}^n l\left(y_i, \hat{y}_i\right)+\alpha *||\boldsymbol{w}||_1+\beta||\boldsymbol{w}||_2^2
\end{align*}$
![[Pasted image 20230420102035.png|300]]