# Ensemble Learning [[lecture12_ensemble.pdf]] ## 1 Bias-Variance to Measure Overfitting ![[Pasted image 20230518155056.png|700]] $\begin{align*} \mathbb{E }_D\left[\left(h_D\left(x^*\right)-y^*\right)^2\right]= \boxed{\underset{\text{variance}}{\underbrace{ \mathbb{E }_D\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right)^2\right] }}+\underset{\text{bias}^2}{\underbrace{ \left(\bar{h}\left(x^*\right)-y^*\right)^2 }}} \end{align*}$ - *Variance* (smaller better): How much your solution changes if you train on a different training set. - How "*over-specialized*" is your solution to a particular training set (overfitting). - *Bias* (smaller better): What is the *inherent* error that you obtain from your solution even with multiple training sets or infinite training data? ### 1.1 Example: OLS ![[7_emsemble 2023-05-18 15.29.16.excalidraw.svg|500]] $\begin{align*} \text{Ground truth: } y&= x+2\sin(1.5 x)\\ \text{Observations: } D&= \left\{\left(x^{(1)}, y^{(1)}\right),\left(x^{(2)}, y^{(2)}\right), \ldots\left(x^{(n)}, y^{(n)}\right)\right\} \end{align*}$ *Solution*. By ordinary least-square (OLS): $\begin{align*} h&= \operatorname{argmin}_h \sum_{i=1}^n\left(h\left(x^{(i)}\right)-y^{(i)}\right)^2\\ \Rightarrow h(x)&= -0.16x+1.2x \end{align*}$ *Evaluation using 1 datapoint*. Pick unseen datapoint $x^*$. $L=\left(h\left(x^*\right)-y^*\right)^2=3.86$. (want to minimize $L$) - This evaluation is not faithful enough - different training set -> different $h(x)$ -> different $L$ - Want a genuine evaluation for $h(x)$. -> evaluate $L$ *in expectation*. *Evaluation in expectation*. $E_D\left[\left(h_D\left(x^*\right)-y^*\right)^2\right]$. Average the error over different training sets $\begin{align*} \begin{array}{cc} \text{Expected test error:} & \mathbb{E}_D\left[\left(h_D\left(x^*\right)-y^*\right)^2\right]\\\ \text{Avg. predicator:} &\bar{h}(x)=\mathbb{E}_D\left[h_D(x)\right] \end{array} \end{align*}$ We can decompose the expected test error $\mathbb{ E}_D$: $\begin{align*} E_D&\left[\left(h_D\left(x^*\right)-y^*\right)^2\right] =E_D\left[\left(\underbrace{h_D\left(x^*\right)-\bar{h}\left(x^*\right)}+\underbrace{\bar{h}\left(x^*\right)-y^*}\right)^2\right] \\ & =E_D\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right)^2\right]+\left(\bar{h}\left(x^*\right)-y^*\right)^2+2 \cancelto{0}{E_D\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right) \cdot\left(\bar{h}\left(x^*\right)-y^*\right)\right]}\\ &= \boxed{\underset{\text{variance}}{\underbrace{ E_D\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right)^2\right] }}+\underset{\text{bias}^2}{\underbrace{ \left(\bar{h}\left(x^*\right)-y^*\right)^2 }}} \end{align*}$ Note that the canceled expression is derived as follows: $\begin{align*} E_D&\left[\left(h_D\left(x^*\right)-\bar{h}\left(x^*\right)\right) \cdot\left(\bar{h}\left(x^*\right)-y^*\right)\right] \\ & =\left(E_D\left[h_D\left(x^*\right)\right]-\bar{h}\left(x^*\right)\right)\left(\bar{h}\left(x^*\right)-y^*\right)\\ &= \left(\bar{h}\left(x^*\right)-\bar{h}\left(x^*\right)\right)\left(\bar{h}\left(x^*\right)-y^*\right)=0\\ \end{align*}$ ## 2 Bagging to Reduce Variance ![[7_emsemble 2023-05-18 15.56.06.excalidraw.svg|600]] ### 2.1 Example: (extremely) random forest - Features could be randomly selected for each model, and at every node. - Splitting rules could be randomly evaluated ## 3 Boosting to Reduce Bias ![[7_ensemble 2023-05-18 16.04.26.excalidraw.svg|600]] ### 3.1 Example: AdaBoost [Paper](https://cseweb.ucsd.edu/~yfreund/papers/IntroToBoosting.pdf). *Algorithm*. Given training set $\left(\boldsymbol{x}_{\boldsymbol{1}}, y_1\right),\left(\boldsymbol{x}_{\mathbf{2}}, y_2\right), \ldots,\left(\boldsymbol{x}_{\boldsymbol{n}}, y_n\right)$. $y_i \in\{+1,-1\}$ is the ground truth label of instance $\boldsymbol{x}_{\boldsymbol{i}}$ For $t=1, \ldots, T$: - ==Construct a distribution $D_t$ on $\{1 \ldots n\}$== - Train a weak learner $h_t$, which maps $\boldsymbol{x}_{\boldsymbol{i}}$ to $\{+1,-1\}$ - Its error rate is $\epsilon_t$ on $D_t$: $\epsilon_t=P_{D_t}\left(h_t\left(\boldsymbol{x}_{\boldsymbol{i}}\right) \neq y_i\right) \leftarrow$ weighted error ==Aggregate all $\boldsymbol{h}_t$ models to the final model $\boldsymbol{H}_{\text {final }}$== ![[Pasted image 20230518161410.png|300]] - $T$ is usually 100 to 1000 *Construct distribution* $D_t$ on $\set{1,\cdots,n}$. ![[Pasted image 20230518161131.png|500]] Note: $\uparrow \epsilon_t \;\equiv\; \downarrow \alpha_t$ - Weak learner assumption: $\epsilon_t<0.5 \rightarrow \alpha_t$ is always positive - Better "weak learner" (smaller error) $\rightarrow$ Larger weight - In fact, it minimizes the error rate of the aggregated model based on the first $t$ weak learners. $ D_{t+1} \propto D_t(i) \times e^{-y_i n_t\left(x_i\right) \alpha_t} $ - Examples that we predicted *correctly* are *demoted*, *others promoted* ## Boosting vs. Bagging Both methods can improve the performance, compared with single models Note: weak learners in ensemble could be different models *Bagging focuses on reducing variance* - A safer choice: more stable *Boosting focuses on reducing bias* - Higher potential. May lead to a high variance if the training data contains some noise