# Ensemble Learning
[[lecture12_ensemble.pdf]]
## 1 Bias-Variance to Measure Overfitting
![[Pasted image 20230518155056.png|700]]
$\begin{align*}
\mathbb{E }_D\left[\left(h_D\left(x^*\right)-y^*\right)^2\right]= \boxed{\underset{\text{variance}}{\underbrace{ \mathbb{E }_D\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right)^2\right] }}+\underset{\text{bias}^2}{\underbrace{ \left(\bar{h}\left(x^*\right)-y^*\right)^2 }}}
\end{align*}$
- *Variance* (smaller better): How much your solution changes if you train on a different training set.
- How "*over-specialized*" is your solution to a particular training set (overfitting).
- *Bias* (smaller better): What is the *inherent* error that you obtain from your solution even with multiple training sets or infinite training data?
### 1.1 Example: OLS
![[7_emsemble 2023-05-18 15.29.16.excalidraw.svg|500]]
$\begin{align*}
\text{Ground truth: } y&= x+2\sin(1.5 x)\\
\text{Observations: } D&= \left\{\left(x^{(1)}, y^{(1)}\right),\left(x^{(2)}, y^{(2)}\right), \ldots\left(x^{(n)}, y^{(n)}\right)\right\}
\end{align*}$
*Solution*. By ordinary least-square (OLS):
$\begin{align*}
h&= \operatorname{argmin}_h \sum_{i=1}^n\left(h\left(x^{(i)}\right)-y^{(i)}\right)^2\\
\Rightarrow h(x)&= -0.16x+1.2x
\end{align*}$
*Evaluation using 1 datapoint*. Pick unseen datapoint $x^*$. $L=\left(h\left(x^*\right)-y^*\right)^2=3.86$. (want to minimize $L$)
- This evaluation is not faithful enough - different training set -> different $h(x)$ -> different $L$
- Want a genuine evaluation for $h(x)$. -> evaluate $L$ *in expectation*.
*Evaluation in expectation*. $E_D\left[\left(h_D\left(x^*\right)-y^*\right)^2\right]$. Average the error over different training sets
$\begin{align*}
\begin{array}{cc}
\text{Expected test error:} & \mathbb{E}_D\left[\left(h_D\left(x^*\right)-y^*\right)^2\right]\\\
\text{Avg. predicator:} &\bar{h}(x)=\mathbb{E}_D\left[h_D(x)\right]
\end{array}
\end{align*}$
We can decompose the expected test error $\mathbb{ E}_D$:
$\begin{align*}
E_D&\left[\left(h_D\left(x^*\right)-y^*\right)^2\right] =E_D\left[\left(\underbrace{h_D\left(x^*\right)-\bar{h}\left(x^*\right)}+\underbrace{\bar{h}\left(x^*\right)-y^*}\right)^2\right] \\
& =E_D\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right)^2\right]+\left(\bar{h}\left(x^*\right)-y^*\right)^2+2 \cancelto{0}{E_D\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right) \cdot\left(\bar{h}\left(x^*\right)-y^*\right)\right]}\\
&= \boxed{\underset{\text{variance}}{\underbrace{ E_D\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right)^2\right] }}+\underset{\text{bias}^2}{\underbrace{ \left(\bar{h}\left(x^*\right)-y^*\right)^2 }}}
\end{align*}$
Note that the canceled expression is derived as follows:
$\begin{align*}
E_D&\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right) \cdot\left(\bar{h}\left(x^*\right)-y^*\right)\right] \\
& =\left(E_D\left[h_D\left(x^*\right)\right]-\bar{h}\left(x^*\right)\right)\left(\bar{h}\left(x^*\right)-y^*\right)\\
&= \left(\bar{h}\left(x^*\right)-\bar{h}\left(x^*\right)\right)\left(\bar{h}\left(x^*\right)-y^*\right)=0\\
\end{align*}$
## 2 Bagging to Reduce Variance
![[7_emsemble 2023-05-18 15.56.06.excalidraw.svg|600]]
### 2.1 Example: (extremely) random forest
- Features could be randomly selected for each model, and at every node.
- Splitting rules could be randomly evaluated
## 3 Boosting to Reduce Bias
![[7_ensemble 2023-05-18 16.04.26.excalidraw.svg|600]]
### 3.1 Example: AdaBoost
[Paper](https://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf).
*Algorithm*. Given training set $\left(\boldsymbol{x}_{\boldsymbol{1}}, y_1\right),\left(\boldsymbol{x}_{\mathbf{2}}, y_2\right), \ldots,\left(\boldsymbol{x}_{\boldsymbol{n}}, y_n\right)$. $y_i \in\{+1,-1\}$ is the ground truth label of instance $\boldsymbol{x}_{\boldsymbol{i}}$
For $t=1, \ldots, T$:
- ==Construct a distribution $D_t$ on $\{1 \ldots n\}$==
- Train a weak learner $h_t$, which maps $\boldsymbol{x}_{\boldsymbol{i}}$ to $\{+1,-1\}$
- Its error rate is $\epsilon_t$ on $D_t$: $\epsilon_t=P_{D_t}\left(h_t\left(\boldsymbol{x}_{\boldsymbol{i}}\right) \neq y_i\right) \leftarrow$ weighted error
==Aggregate all $\boldsymbol{h}_t$ models to the final model $\boldsymbol{H}_{\text {final }}$==
![[Pasted image 20230518161410.png|300]]
- $T$ is usually 100 to 1000
*Construct distribution* $D_t$ on $\set{1,\cdots,n}$.
![[Pasted image 20230518161131.png|500]]
Note: $\uparrow \epsilon_t \;\equiv\; \downarrow \alpha_t$
- Weak learner assumption: $\epsilon_t<0.5 \rightarrow \alpha_t$ is always positive
- Better "weak learner" (smaller error) $\rightarrow$ Larger weight
- In fact, it minimizes the error rate of the aggregated model based on the first $t$ weak learners.
$
D_{t+1} \propto D_t(i) \times e^{-y_i n_t\left(x_i\right) \alpha_t}
$
- Examples that we predicted *correctly* are *demoted*, *others promoted*
## Boosting vs. Bagging
Both methods can improve the performance, compared with single models
Note: weak learners in ensemble could be different models
*Bagging focuses on reducing variance*
- A safer choice: more stable
*Boosting focuses on reducing bias*
- Higher potential. May lead to a high variance if the training data contains some noise