# Naive Bayes and Decision Trees [[lecture11_nb_and_dt.pdf]] ## 1 Naive Bayes $\begin{align*} &\boxed{\text { Posterior }=\frac{\text { Prior } \times \text { Likelihood }}{\text { Evidence }}}\\\\ P\left(y_i, \boldsymbol{x}_{\boldsymbol{i}}\right) &= P\left(x_{i, 1}, x_{i, 2}, \ldots, x_{i, d} \mid y_i\right) P\left(y_i\right) & \quad\text{chaain rule of conditional prob.}\\ &=P\left(x_{i, 1} \mid y_i\right) P\left(x_{i, 2} \mid y_i\right) \ldots P\left(x_{i, d} \mid y_i\right) P\left(y_i\right) &\quad\text{conditional independence}\\ & =\boxed{P\left(y_i\right) \prod_{j=1}^d P\left(x_{i, j} \mid y_i\right)} \end{align*}$ ***Naïve Bayes Assumption***: All features $X_1, X_2, \ldots, X_d$ are *conditionally* *mutually independent* (conditional on the label $Y$). $\begin{align*} P\left(X_1, X_2, \ldots X_d \mid Y\right)=P\left(X_1 \mid Y\right)\times \cdots \times P\left(X_d \mid Y\right) \end{align*}$ *Space complexity*. $\boxed{|Y|+|Y| \sum_{j=1}^d\left|X_j\right| \geq4d+2}$. Parameters include $P\left(x_{i, j} \mid y_i\right)$ and $P\left(y_i\right)$ - This number is much smaller than the case [[#1.2 Motivation: simple classification problem|without Naïve Bayes Assumption]]! ### 1.1 Conditional Independence Note: Independent variables may not be conditionally independent *Example*. ![[Pasted image 20230518135408.png|200]] 1. Verify $X_1, X_2$ are (~~conditionally~~) independent. $\begin{align*} P(X_1=0,X_2=0)&= 1/4 =P(X_1=0)P(X_2=0)\\ \vdots\\ P(X_1=1,X_2=1)&= 1/4 =P(X_1=1)P(X_2=1) \end{align*}$ 2. Verify $X_1,X_2$ are *NOT conditionally* independent. $\begin{align*} \begin{aligned} & P(X 1=1 \mid Y=0)=P(X 2=1 \mid Y=0)=\frac{1/4}{3/4}=1 / 3 \\ & P(X 1=1, X 2=1 \mid Y=0)=0\neq P(X1|Y)P(X2|Y) \end{aligned} \end{align*}$ ### 1.2 Motivation: simple classification problem Suppose $n$ training instances $(x_i,y_i), x_i\in \mathbb{R}_{}^{n}, y\in{0,1}$. *Goal*. Estimate $P\left(Y=y_i \mid X=x_i\right)$, i.e. the likelihood of label $y_i$ given the observation $x_i$. $\begin{align*} P(y_i|x_i)=\frac{P(y_i,x_i)}{P(x_i)} & \quad\text{conidtional probability def.}\\ \end{align*}$ - We can simply estimate $P(y_i,x_i)$ and $P(x_i)$ simply by counting (given enough training data) $\begin{align*} P(y_i|x_i)=\frac{P(y_i,x_i)}{P(x_i)} &= \frac{\text{count}(y_i,x_i)}{\text{count}(x_i)} \end{align*}$ - $P(x_i)$ can be aggregated from $P(y_i, x_i)$ $\begin{align*} P(x_i)=\sum\limits_{j=1}^kP(y_j, x_i) \quad\text{sum over \textit{k} labels }\end{align*}$ - Model parameter: $\Theta=\set{P(y_i,x_i), \; \forall i}$. The number of parameter's complexity is given by (assuming at least 2 labels and at least 2 possible values at each feature dimension): $\begin{align*} \underset{\text{\# distinct labels}}{\underbrace{ |Y| }} \times \prod_{i=1}^d\underset{\# \text{distinct featuers}}{\underbrace{ \left|X_i\right| }} \geq \boxed{2^{d+1}} \quad\text{exponential space complexity!} \end{align*}$ We need *naive Bayes* to simplify the parameter space. ### 1.3 Example: email spam detection The naive Bayes model is: $\begin{align*} P\left(y_i, \boldsymbol{x}_{\boldsymbol{i}}\right)=P\left(y_i\right) \prod_{j=1}^d P\left(x_{i, j} \mid y_i\right) \end{align*}$ - $X_i$: whether the word $i$ appears in the email or not - $Y$: whether the email is a spam or not *Fit the naive Bayes model by counting*. (need sufficient amount of data to train still) - $P\left(y_i=\text{spam}\right)=\frac{\text{\# spam}}{\# \text{emails}}$: in general, how likely an email is a spam? - $P\left(x_{i, j} \mid y_i\right)$: in spam/normal emails, how likely the word I will appear? ### 1.4 Naive Bayes with numerical (continuous) features - *Method 1*: Binning q Put continuous values into discrete bins - *Pro*: Easy to implement. *Con*: Based on a strong assumption, which may not hold in real dataset. - If some $P(x_{i,j}|y_i)=0$: 1) *bin* them into "others", or 2) *smoothing* - replaces 0 with small $\epsilon$. - *Method 2*: Assume the continuous values associated with each class are distributed according to *some known distribution*, e.g. Gaussian Naïve Bayes, Multinomial Naïve Bayes, Bernoulli Naïve Bayes ## 2 Decision Tree ![[Pasted image 20230518150404.png|500]] Every *node* has a condition to check. Different results of the check become different branches *Algorithm*. Input: A node with all its assigned training data. - Decide the best *splitting rule* (i.e., a variable & a splitting value) - (May only enumerate some combinations for efficiency ) - Create child nodes & Assign training data according to the *splitting rule*. - If all the training data assigned to the node are in the same class, then stop - Recursion! (This outline is followed by popular tree-building algorithms as CART, C4.5 and ID3) *Decide best variable to split*. - Maximize information gain (i.e. difference in entropy) - Gini Index Difference - Mean Square Error (MSE) Difference ### 2.1 Entropy-based decision tree $\begin{align*} S=-\sum_y P(Y=y) \log P(Y=y) \end{align*}$ Given the distribution of a categorical variable, $P(Y=y)$ - One-hot distribution $\rightarrow S=0$ - Uniform distribution $\rightarrow S=\log k$ Start with $n$ training data $\left(\boldsymbol{x}_{\boldsymbol{i}}, y_i\right)$ at the "parent node". Use all $y_i$ estimate $P(Y=y) \rightarrow$ Entropy $S$ Suppose the selected feature is $X_j$, and we will split it based on $\left(x_{i, j}<1.5\right) \leftarrow$ (this 1.5 is just an arbitrary example) - Create two branches and distribute training data accordingly - Compute $S_1$ and $S_2$ from the two branches - Compute the weighted entropy as $S^{\prime}=\frac{S_1 * n_1+S_2 * n_2}{n_1+n_2}$. ($n_1+n_2=n$) - Check the difference: $S-S^{\prime}$ (a.k.a., info gain). The bigger, the better *Example*. Decision tree using entropy difference as criteria. ![[Pasted image 20230518144327.png|300]] $\begin{align*} P(Y=\text{Yes})=\frac{4}{9}\quad P(Y=\text{Yes})=\frac{5}{9} \end{align*}$ Information gain from using `color of shirt` as the split rule is: $\begin{align*} \begin{array}{c|ccc} \text{Color of shirt}&\text{G}&\text{B}&\text{R}\\\hline \text{Label instances}&\text{YYY}&\text{YN}&\text{NNNN}\\\hline \text{Class entropy }S_i &0&\log 2&0\end{array}\quad\quad S' = \frac{0\times 3+2\times \log 2+0\times 4}{9} \end{align*}$ The final decision tree is: $\begin{align*} \begin{array}{c|c|c|c} \text{Color of shirt}&\text{G}&\text{R}&\text{B}\\\hline \text{Label dicision} &\text{Y}&\text{N} &P(Y)=P(N)=\frac{1}{2} \end{array} \end{align*}$ ### 2.2 Gini-index-based decision tree $\begin{align*} \text{Gini}=1-\sum_y P(Y=y)^2 \end{align*}$ - One-hot distribution $\rightarrow$ gini $=0$ - Uniform distribution $\rightarrow$ gini $=1-\frac{1}{k}$ - Similar to Entropy, we can calculate the difference ### 2.3 Overfitting Training Accuracy is usually 100% if we grow the tree till perfectly classified Assume the features of the training data are pair-wisely different. - *Overfitting*: An induced tree may overfit the training data - Too many branches, some may reflect anomalies due to noise or outliers - Poor accuracy for unseen samples