# Naive Bayes and Decision Trees
[[lecture11_nb_and_dt.pdf]]
## 1 Naive Bayes
$\begin{align*}
&\boxed{\text { Posterior }=\frac{\text { Prior } \times \text { Likelihood }}{\text { Evidence }}}\\\\
P\left(y_i, \boldsymbol{x}_{\boldsymbol{i}}\right) &= P\left(x_{i, 1}, x_{i, 2}, \ldots, x_{i, d} \mid y_i\right) P\left(y_i\right) & \quad\text{chaain rule of conditional prob.}\\
&=P\left(x_{i, 1} \mid y_i\right) P\left(x_{i, 2} \mid y_i\right) \ldots P\left(x_{i, d} \mid y_i\right) P\left(y_i\right) &\quad\text{conditional independence}\\
& =\boxed{P\left(y_i\right) \prod_{j=1}^d P\left(x_{i, j} \mid y_i\right)}
\end{align*}$
***Naïve Bayes Assumption***: All features $X_1, X_2, \ldots, X_d$ are *conditionally* *mutually independent* (conditional on the label $Y$).
$\begin{align*}
P\left(X_1, X_2, \ldots X_d \mid Y\right)=P\left(X_1 \mid Y\right)\times \cdots \times P\left(X_d \mid Y\right)
\end{align*}$
*Space complexity*. $\boxed{|Y|+|Y| \sum_{j=1}^d\left|X_j\right| \geq4d+2}$. Parameters include $P\left(x_{i, j} \mid y_i\right)$ and $P\left(y_i\right)$
- This number is much smaller than the case [[#1.2 Motivation: simple classification problem|without Naïve Bayes Assumption]]!
### 1.1 Conditional Independence
Note: Independent variables may not be conditionally independent
*Example*.
![[Pasted image 20230518135408.png|200]]
1. Verify $X_1, X_2$ are (~~conditionally~~) independent.
$\begin{align*}
P(X_1=0,X_2=0)&= 1/4 =P(X_1=0)P(X_2=0)\\
\vdots\\
P(X_1=1,X_2=1)&= 1/4 =P(X_1=1)P(X_2=1)
\end{align*}$
2. Verify $X_1,X_2$ are *NOT conditionally* independent.
$\begin{align*}
\begin{aligned}
& P(X 1=1 \mid Y=0)=P(X 2=1 \mid Y=0)=\frac{1/4}{3/4}=1 / 3 \\
& P(X 1=1, X 2=1 \mid Y=0)=0\neq P(X1|Y)P(X2|Y)
\end{aligned}
\end{align*}$
### 1.2 Motivation: simple classification problem
Suppose $n$ training instances $(x_i,y_i), x_i\in \mathbb{R}_{}^{n}, y\in{0,1}$.
*Goal*. Estimate $P\left(Y=y_i \mid X=x_i\right)$, i.e. the likelihood of label $y_i$ given the observation $x_i$.
$\begin{align*}
P(y_i|x_i)=\frac{P(y_i,x_i)}{P(x_i)} & \quad\text{conidtional probability def.}\\
\end{align*}$
- We can simply estimate $P(y_i,x_i)$ and $P(x_i)$ simply by counting (given enough training data)
$\begin{align*}
P(y_i|x_i)=\frac{P(y_i,x_i)}{P(x_i)} &= \frac{\text{count}(y_i,x_i)}{\text{count}(x_i)}
\end{align*}$
- $P(x_i)$ can be aggregated from $P(y_i, x_i)$
$\begin{align*}
P(x_i)=\sum\limits_{j=1}^kP(y_j, x_i) \quad\text{sum over \textit{k} labels
}\end{align*}$
- Model parameter: $\Theta=\set{P(y_i,x_i), \; \forall i}$. The number of parameter's complexity is given by (assuming at least 2 labels and at least 2 possible values at each feature dimension):
$\begin{align*}
\underset{\text{\# distinct labels}}{\underbrace{ |Y| }} \times \prod_{i=1}^d\underset{\# \text{distinct featuers}}{\underbrace{ \left|X_i\right| }} \geq \boxed{2^{d+1}} \quad\text{exponential space complexity!}
\end{align*}$
We need *naive Bayes* to simplify the parameter space.
### 1.3 Example: email spam detection
The naive Bayes model is:
$\begin{align*}
P\left(y_i, \boldsymbol{x}_{\boldsymbol{i}}\right)=P\left(y_i\right) \prod_{j=1}^d P\left(x_{i, j} \mid y_i\right)
\end{align*}$
- $X_i$: whether the word $i$ appears in the email or not
- $Y$: whether the email is a spam or not
*Fit the naive Bayes model by counting*. (need sufficient amount of data to train still)
- $P\left(y_i=\text{spam}\right)=\frac{\text{\# spam}}{\# \text{emails}}$: in general, how likely an email is a spam?
- $P\left(x_{i, j} \mid y_i\right)$: in spam/normal emails, how likely the word I will appear?
### 1.4 Naive Bayes with numerical (continuous) features
- *Method 1*: Binning q Put continuous values into discrete bins
- *Pro*: Easy to implement. *Con*: Based on a strong assumption, which may not hold in real dataset.
- If some $P(x_{i,j}|y_i)=0$: 1) *bin* them into "others", or 2) *smoothing* - replaces 0 with small $\epsilon$.
- *Method 2*: Assume the continuous values associated with each class are distributed according to *some known distribution*, e.g. Gaussian Naïve Bayes, Multinomial Naïve Bayes, Bernoulli Naïve Bayes
## 2 Decision Tree
![[Pasted image 20230518150404.png|500]]
Every *node* has a condition to check. Different results of the check become different branches
*Algorithm*. Input: A node with all its assigned training data.
- Decide the best *splitting rule* (i.e., a variable & a splitting value)
- (May only enumerate some combinations for efficiency )
- Create child nodes & Assign training data according to the *splitting rule*.
- If all the training data assigned to the node are in the same class, then stop
- Recursion! (This outline is followed by popular tree-building algorithms as CART, C4.5 and ID3)
*Decide best variable to split*.
- Maximize information gain (i.e. difference in entropy)
- Gini Index Difference
- Mean Square Error (MSE) Difference
### 2.1 Entropy-based decision tree
$\begin{align*}
S=-\sum_y P(Y=y) \log P(Y=y)
\end{align*}$
Given the distribution of a categorical variable, $P(Y=y)$
- One-hot distribution $\rightarrow S=0$
- Uniform distribution $\rightarrow S=\log k$
Start with $n$ training data $\left(\boldsymbol{x}_{\boldsymbol{i}}, y_i\right)$ at the "parent node". Use all $y_i$ estimate $P(Y=y) \rightarrow$ Entropy $S$
Suppose the selected feature is $X_j$, and we will split it based on $\left(x_{i, j}<1.5\right) \leftarrow$ (this 1.5 is just an arbitrary example)
- Create two branches and distribute training data accordingly
- Compute $S_1$ and $S_2$ from the two branches
- Compute the weighted entropy as $S^{\prime}=\frac{S_1 * n_1+S_2 * n_2}{n_1+n_2}$. ($n_1+n_2=n$)
- Check the difference: $S-S^{\prime}$ (a.k.a., info gain). The bigger, the better
*Example*. Decision tree using entropy difference as criteria.
![[Pasted image 20230518144327.png|300]]
$\begin{align*}
P(Y=\text{Yes})=\frac{4}{9}\quad P(Y=\text{Yes})=\frac{5}{9}
\end{align*}$
Information gain from using `color of shirt` as the split rule is:
$\begin{align*}
\begin{array}{c|ccc}
\text{Color of shirt}&\text{G}&\text{B}&\text{R}\\\hline
\text{Label instances}&\text{YYY}&\text{YN}&\text{NNNN}\\\hline
\text{Class entropy }S_i &0&\log 2&0\end{array}\quad\quad S' = \frac{0\times 3+2\times \log 2+0\times 4}{9}
\end{align*}$
The final decision tree is:
$\begin{align*}
\begin{array}{c|c|c|c}
\text{Color of shirt}&\text{G}&\text{R}&\text{B}\\\hline
\text{Label dicision} &\text{Y}&\text{N} &P(Y)=P(N)=\frac{1}{2}
\end{array}
\end{align*}$
### 2.2 Gini-index-based decision tree
$\begin{align*}
\text{Gini}=1-\sum_y P(Y=y)^2
\end{align*}$
- One-hot distribution $\rightarrow$ gini $=0$
- Uniform distribution $\rightarrow$ gini $=1-\frac{1}{k}$
- Similar to Entropy, we can calculate the difference
### 2.3 Overfitting
Training Accuracy is usually 100% if we grow the tree till perfectly classified
Assume the features of the training data are pair-wisely different.
- *Overfitting*: An induced tree may overfit the training data
- Too many branches, some may reflect anomalies due to noise or outliers
- Poor accuracy for unseen samples