# Maximum Likelihood Estimation (MLE)
Slides: [[lec9_fitting_distribution.pdf|fitting distribution]]
## 1 Distributional Modeling
Fit a probability distribution to the dataset.
- Simple and compact.
- Captures the big picture while smoothing out the wrinkles in the data.
- In subsequent application, use distribution as a proxy for the data.
*Choosing a distribution*: There exist a few distributions of great universality which occur in a surprisingly large number of problems. The three principal distributions, with ramifications throughout probability theory, are the: 1) *binomial distribution*, 2) *normal distribution*, 3) *Poisson distribution* (William Feller)
- For *higher dimension*, can use combinations of 1-d models: *products* and *mixtures*.
### 1.1 Normal Distribution
![[Pasted image 20230503155008.png|200]]
The ***normal*** (Gaussian) $N\left(\mu, \sigma^2\right)$ has mean $\mu$, variance $\sigma^2$, and density function
$
p(x)=\frac{1}{\left(2 \pi \sigma^2\right)^{1 / 2}} \exp \left(-\frac{(x-\mu)^2}{2 \sigma^2}\right)
$
- $68.3 \%$ of the distribution lies within one standard deviation of the mean, $\mu \pm \sigma$
- $95.5 \%$ lies within $\mu \pm 2 \sigma$
- $99.7 \%$ lies within $\mu \pm 3 \sigma$
***Central Limit Theorem***: Let $X_1, X_2, \ldots$ be independent with $\mathbb{E} X_i=\mu_i, \operatorname{var}\left(X_i\right)=v_i$. Then
$
\frac{\left(X_1+\cdots+X_n\right)-\left(\mu_1+\cdots \mu_n\right)}{\sqrt{v_1+\cdots+v_n}} \stackrel{d}{\longrightarrow} N(0,1)
$
**Fitting Gaussian to Data:** Given: Data points $x_1, \ldots, x_n$ to which we want to fit a distribution. What Gaussian distribution $N\left(\mu, \sigma^2\right)$ should we choose?
- *Empirical mean*: $\mu=\frac{1}{n}\sum\limits_{i=1}^n x_i$
- *Empirical variance*: $\sigma^2=\frac{1}{n}\sum\limits_{i=1}^n(x_i-\mu)^2$
### 1.2 Poison Distribution
![[Pasted image 20230503160605.png|200]]
***Poisson distribution***: $\text { A distribution over the non-negative integers }\{0,1,2, \ldots\}$.
$\text{Poisson}(\lambda)$, with $\lambda>0$ :
$\begin{align*}
\operatorname{Pr}(X=k)&= e^{-\lambda} \frac{\lambda^k}{k !}\\
\end{align*}$
- *Mean*: $\mathbb{E} X=\lambda$
- *Variance*: $\mathbb{E}(X-\lambda)^2=\lambda$. *Proof*.
$\begin{align*}
\sum\limits_{k=0}^\infty e^{-k}\frac{\lambda^k}{k!}&= e^{-\lambda}(1+\lambda+\frac{\lambda^2}{2!}+\frac{\lambda^3}{3!}+\cdots)=e^{-\lambda}e^{\lambda}=1\\
\mathbb{E}[X]&= \sum\limits_{k=0}^\infty k\text{Pr(X=k)}=\sum\limits_{k=0}^\infty ke^{-k}\frac{\lambda^k}{k!}=e^{-\lambda}\sum\limits_{k=1}^\infty\frac{\lambda^k}{(k-1)!}\\
&= e^{-\lambda}\lambda\sum\limits_{k=1}^\infty\frac{\lambda^{k-1}}{(k-1)!}=e^{-\lambda}\lambda\sum\limits_{k=0}^\infty\frac{\lambda^k}{k!}=\lambda\\
\end{align*}$
**Definition/Origin**: Count the number of events (collisions, phone calls, etc) that occur in a certain interval of time. Call this number $X$, and say it has expected value $\lambda$.
- Now suppose we divide the interval into small pieces of equal length
- If the probability of an event occurring in a small interval is: 1) *independent* of what happens in other small intervals, and 2) the same across small intervals
- Then: $X\sim \text{Poisson}(\lambda)$
*Example*: Rutherford's experiments with radioactive disintegration
![[Pasted image 20230503162141.png|400]]
*Example*: Flying bomb hitting London in WWII
![[Pasted image 20230503170459.png|400]]
### 1.3 Binomial distribution
$\operatorname{Binomial}(n, p): \text { \# of heads from } n \text { independent coin tosses of bias (heads prob) } p \text {. }$
$\begin{aligned}
\operatorname{Pr}(X=k) =\left(\begin{array}{l}n \\k\end{array}\right) p^k(1-p)^{n-k}\quad\Bigg|\quad \mathbb{E} X =n p
\quad\Bigg|\quad \operatorname{var}(X) =n p(1-p)
\end{aligned}$
*Proof*. For the $i$-th coin flip $x_i= \begin{cases}1 & \text { if heads } \\ 0 & \text { if tails }\end{cases}$. Then $X=\sum\limits_{i=1}^nX_i$.
$\begin{align*}
\mathbb{E}[X_i]=p &\quad\rightarrow\quad \mathbb{E}[X]=n\mathbb{E}[X]=np\\
\\
\text{var}(X_i)&= \mathbb{E}[X_i^2]-(\mathbb{E}(X_i))^2=p-p^2 =p(1-p) \\
&\quad\rightarrow\quad\text{var(X)}=np(1-p)
\end{align*}$
*Example*: Survey on food tastes.
- You choose 1000 people at random and ask them whether they like sushi. 600 say yes.
- What is a good estimate for the fraction of people who like sushi? Clearly, $60 \%$.
### 1.4 Central limit theorem
*Claim*. Binomial is well approximated by Gaussian when $p$ is reasonably greater than $0$.
![[Pasted image 20230506154423.png|400]]
When a coin of bias $p$ is tossed $n$ times, let $S_n$ be the number of heads.
- We know $S_n$ has *mean* $n p$ and *variance* $n p(1-p)$.
- By ***central limit theorem***: As $n$ grows, the distribution of $S_n$ looks increasingly like a Gaussian with this mean and variance, i.e.,
$
\frac{S_n-n p}{\sqrt{n p(1-p)}} \stackrel{d}{\longrightarrow} N(0,1)
$
- Note: $\overset{d}{\longrightarrow}$ means "converges in distribution to".
### 1.5 Poisson limit theorem
*Claim*. Binomial is well approximated by Poisson when $p$ is close to $0$
Toss coins with bias $p_1, \ldots, p_n$ and let $S_n$ be the number of heads.
***Le Cam's inequality***. Let $\lambda=p_1+\cdots+p_n$:
$
\sum_{k=0}^{\infty}\left|\operatorname{Pr}\left(S_n=k\right)-e^{-\lambda} \frac{\lambda^k}{k !}\right| \leq \sum_{i=1}^n p_i^2
$
- An upper bounded for the $L_1$ distance between $Sn$ and $\text{Poisson}(\lambda)$
***Poisson limit theorem***: If all $p_i=\lambda / n$, then
$
S_n \stackrel{d}{\longrightarrow} \text { Poisson }(\lambda)
$
- Also called "the law of rare events" or "the law of small numbers"
### 1.6 Multinomial Distribution
$\begin{align*}
\operatorname{Pr}\left(n_1, \ldots, n_k\right)=\left(\begin{array}{c}n \\ n_1, n_2, \ldots, n_k\end{array}\right) p_1^{n_1} p_2^{n_2} \cdots p_k^{n_k} \quad\Bigg\vert\quad \mathbb{E} X=\left(n p_1, n p_2, \ldots, n p_k\right)
\end{align*}$
Imagine a $k$-faced die, with probabilities $p_1, \ldots, p_k$. Toss such a die $n$ times, and count the number of times each of the $k$ faces occurs.
- The distribution of $X=\left(X_1, \ldots, X_k\right)$ is called the ***multinomial***. $X_j=\#$ of times face $j$ occurs
- Parameters: $p_1, \ldots, p_k \geq 0$, with $\sum\limits_{i=1}^kp_i=1$. Note that $\sum\limits_{i=1}^k n_i=n$
- $\left(\begin{array}{c}n \\n_1, n_2, \ldots, n_k\end{array}\right)=\frac{n !}{n_{1} ! n_{2} ! \cdots n_{k} !}$ is the $\#$ of ways to place balls numbered $\{1, \ldots, n\}$ into bins numbered $\{1, \ldots, k\}$.
*Example*. Bag-of-words representation of text document.
- Fix $V=$ some vocabulary. $|V|=n$
- Treat words in document as independent draws from a multinomial over $V$ :
$
p=\left(p_1, \ldots, p_{|V|}\right), \text { such that } p_i \geq 0 \text { and } \sum_i p_i=1
$
## 2 Maximum Likelihood Estimation (MLE)
Let $\mathcal{P}=\left\{P_\theta: \theta \in \Theta\right\}$ be a class of probability distributions (Gaussians, Poissons, etc).
***Maximum likelihood principle***: pick the $\theta \in \Theta$ that makes the data maximally likely, that is, maximizes $\operatorname{Pr}($ data $\mid \theta)=P_\theta$ (data).
Steps:
1. Write down an expression for the likelihood, $\operatorname{Pr}(\text{data}|\theta)$. Goal is to find $\theta$ what maximizes $\text{Pr}( \text{data}|\theta)$
2. Maximizing this is the same as maximizing its log, the *log-likelihood*.
3. Solve for the maximum-likelihood parameter $\theta$.
### 2.1 MLE for Poisson
Recall Poisson distribution: $\text{Pr}(x|\lambda)=e^{-\lambda}\frac{\lambda^x}{x!}$.
Suppose $\mathcal{P}=\{\text { Poisson }(\lambda): \lambda>0\} . \text { We observe } x_1, \ldots, x_n$.
1. Write down an expression for the likelihood, $\operatorname{Pr}(\text{data}|\lambda)$.
$\begin{align*}
\text{Pr}(x_1,\cdots,x_n|\lambda)&= \prod_{i=1}^n\text{Pr}(x_i|\lambda)=\prod_{i=1}^n e^{-\lambda}\frac{\lambda^{x_i}}{x_i!}\\
&= \frac{e^{-n\lambda}\lambda^{x_1+\cdots+x_n}}{x_1!\cdots x_n!}
\end{align*}$
2. Maximizing this is the same as maximizing its log, the log-likelihood.
$\begin{align*}
\text{LL}=\log\mathcal{L}(\lambda)=-n \lambda+ \ln \lambda\sum\limits_{i=1}^nx_i-\ln(x_1!\cdots x_n!)
\end{align*}$
3. Solve for the maximum-likelihood parameter $\theta$.
$\begin{align*}
\frac{\partial }{\partial \lambda}\log\mathcal{L}(\lambda)&= -n+ \frac{1}{\lambda}\sum\limits_{i=1}^nx_i=0\\
\rightarrow \lambda&= \underset{\text{empirical mean}}{\underbrace{ \frac{1}{n}\sum\limits_{i=1}^nx_i }}
\end{align*}$
### 2.2 MLE for Gaussian
You see $n$ data points $x_1, \ldots, x_n \in \mathbb{R}$, and want to fit a Gaussian $N\left(\mu, \sigma^2\right)$ to them.
1. Maximum likelihood: pick $\mu, \sigma$ to maximize
$
\operatorname{Pr}\left(\text { data } \mid \mu, \sigma^2\right)=\prod_{i=1}^n\left(\frac{1}{\left(2 \pi \sigma^2\right)^{1 / 2}} \exp \left(-\frac{\left(x_i-\mu\right)^2}{2 \sigma^2}\right)\right)
$
2. Work with the log, since it makes things easier:
$
\mathrm{LL}\left(\mu, \sigma^2\right)=\frac{n}{2} \ln \frac{1}{2 \pi \sigma^2}-\sum_{i=1}^n \frac{\left(x_i-\mu\right)^2}{2 \sigma^2}
$
3. Setting the derivatives to zero, we get
$
\begin{aligned}
\mu & =\frac{1}{n} \sum_{i=1}^n x_i &\quad\text{(empirical mean)}\\
\sigma^2 & =\frac{1}{n} \sum_{i=1}^n\left(x_i-\mu\right)^2 &\quad\text{(empirical variance)}
\end{aligned}
$
### 2.3 MLE for Binomial
Say you observe $n$ tosses of a coin of unknown bias, and $k$ come up heads. What distribution binomial $(n, p)$ is the best fit to this data?
- Choose $p=k/n$ which maximizes $\text{Pr}(k|p)$:
$\begin{align*}
\text{Pr}(k|p )&= p^k(1-p)^{n-k}\\
\text{LL}(p)&= k\ln p+(n-k)\ln(1-p)\\
\frac{\partial }{\partial p}LL&= \frac{k}{p}-\frac{n-k}{1-p}=0\\
\rightarrow p&= \frac{k}{n}
\end{align*}$
*Caveat*. Say have two coins of unknown bias.
- Toss 1st coin 10 times, got 10 heads. MLE yields estimated bias: $p_1=1$
- Toss 2nd coin 10 times, got 1 head. MLE yields estimated bias: $p_2=0.1$
- Toss a coin 20 times, got 19 heads. Which coin is more like to be the one being tossed?
$\begin{align*}
\text{Pr}( \text{19 heads from 20 tosses}|p_1=1)&= 0 \quad\text{counterintuivite!}\\
\text{Pr}( \text{19 heads from 20 tosses}|p_1=0.1)&\gt 0
\end{align*}$
- **We never want MLE to give and estimated $p=1$ or $p=0$**.
***Laplace smoothing***. Smooth MLE by estimating bias using the alternative formula:
$\begin{align*}
p&= \frac{k+1}{n+2}
\end{align*}$
*Example*. *Laplace's law of succession*: What is the probability that the sun won't rise tomorrow?
- Let $p$ be the probability that the sun won't rise on a randomly chosen day. We want to estimate $p$.
- For the past 5000 years (= 1825000 days), the sun has risen every day. Using Laplace smoothing, estimate
$p=\frac{1}{1825002}$
### Other distribution families
1. *Gamma*: two-parameter family of distributions over $\mathbb{R}^{+}$
2. **Beta: two-parameter family of distributions over $[0,1]$
2. *Dirichlet*: k-parameter family of distributions over the k-probability simplex
All of these are exponential families of distributions.
## 3 MLE Alternatives and Desiderata for Estimators
Choosing a model in $\left\{P_\theta: \theta \in \Theta\right\}$ given observations $x_1, x_2, \ldots, x_n$.
*MLE*. Most common.
- Pick $\theta$ that maximizes $\prod_{i=1}^np_\theta(x_i)$
*Method of moments*. Pick the model whose moments $\mathbb{E}_{X \sim P_\theta} f(X)$ match empirical estimates.
*Bayesian estimation*. Return the maximum a-posteriori distribution, or the overall posterior.
- Put a prior $\pi$ on the space $\Theta$. $\text{posterior}(\theta)\propto(\theta)\prod_{i}p_\theta(x_i)$
*Maximum entropy*. We'll see this soon.
*Other* optimization-based or game-theoretic criteria. As in generative adversarial nets, for instance.
- *Noise-contrastive estimation*
### Desiderata for probability estimators
Goal: Given data $x_1, \ldots, x_n$, want to choose a model $P_\theta, \theta \in \Theta$.
- Let $T\left(x_1, \ldots, x_n\right)$ be some estimator of $\theta$.
- Suppose $X_1, \ldots, X_n$ are i.i.d. draws from $P_\theta$. Ideally $T\left(X_1, \ldots, X_n\right) \approx \theta$.
Some typical *desiderata*, if $X_1, \ldots, X_n \sim P_\theta$.
1. Unbiased: $\mathbb{E} T\left(X_1, \ldots, X_n\right)=\theta$.
2. *Asymptotically consistent*: $T\left(X_1, \ldots, X_n\right) \rightarrow \theta$ as $n \rightarrow \infty$.
3. Low variance: $\operatorname{var}\left(T\left(X_1, \ldots, X_n\right)\right)$ is small.
4. Computationally feasible: Is $T\left(X_1, \ldots, X_n\right)$ easy to compute?
### MLE vs. Desiderata
*Is MLE* *unbiased*? In general, no. Example: Fit a normal distribution to observations $X_1, \ldots, X_n \sim N\left(\mu, \sigma^2\right)$.
- MLE yields:
$
\begin{aligned}
\widehat{\mu} & =\frac{X_1+\cdots+X_n}{n} \quad\quad
\widehat{\sigma}^2 =\frac{\left(X_1-\widehat{\mu}\right)^2+\cdots+\left(X_n-\widehat{\mu}\right)^2}{n}
\end{aligned}
$
- Can check that $\mathbb{E}[\widehat{\mu}]=\mu$ (empirical mean is unbiased) but the *empirical variance is biased*.
$
\mathbb{E}\left[\widehat{\sigma}^2\right]=\frac{n-1}{n} \sigma^2
$
*Is MLE asymptotically consistent*? Not always, but under some conditions, yes.
Intuition: Given data $X_1, \ldots, X_n \sim P_{\theta^*}$, want to choose a model $P_\theta, \theta \in \Theta$. I.e. pick the $\theta$ that maximizes
$\begin{aligned}
\frac{1}{n} L L(\theta)=\frac{1}{n} \sum_{i=1}^n \ln P_\theta\left(X_i\right) & \rightarrow \mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_\theta(X)\right] \quad\text{(Monte-Carlo estimate)}\\
& =\mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_{\theta^*}(X)\right]-KL\left(P_{\theta^*}, P_\theta\right)\\\\
\leftrightarrow \underset{\theta}{\arg\max}LL(\theta)&= \underset{\theta}{\arg\min}\;KL(P_{\theta^*}, P_\theta)
\end{aligned}$
WTS: $\mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_\theta(X)\right]=\mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_{\theta^*}(X)\right]-KL\left(P_{\theta^*}, P_\theta\right)$. *Proof*.
$\begin{align*}
& \mathbb{E}_{X_{\sim} \sim P_{\theta^*}}\left[\ln P_{\theta^*}(x)\right]-\mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_\theta(x)\right] \\
& =\sum_x P_{\theta^*}(x) \ln P_{\theta^*}(x)-\sum_x P_{\theta^*}(x) \ln p_\theta(x) \\
& =\sum_x P_{\theta^*}(x) \ln \frac{P_{\theta^*}(x)}{P_\theta(x)}=KL\left(P_{\theta^*}, P_\theta\right) \geq 0
\end{align*}$
$\begin{align*}
\end{align*}$
## Reference
[information theory - Why KL divergence is non-negative? - Cross Validated (stackexchange.com)](https://stats.stackexchange.com/questions/335197/why-kl-divergence-is-non-negative)