# Maximum Likelihood Estimation (MLE) Slides: [[lec9_fitting_distribution.pdf|fitting distribution]] ## 1 Distributional Modeling Fit a probability distribution to the dataset. - Simple and compact. - Captures the big picture while smoothing out the wrinkles in the data. - In subsequent application, use distribution as a proxy for the data. *Choosing a distribution*: There exist a few distributions of great universality which occur in a surprisingly large number of problems. The three principal distributions, with ramifications throughout probability theory, are the: 1) *binomial distribution*, 2) *normal distribution*, 3) *Poisson distribution* (William Feller) - For *higher dimension*, can use combinations of 1-d models: *products* and *mixtures*. ### 1.1 Normal Distribution ![[Pasted image 20230503155008.png|200]] The ***normal*** (Gaussian) $N\left(\mu, \sigma^2\right)$ has mean $\mu$, variance $\sigma^2$, and density function $ p(x)=\frac{1}{\left(2 \pi \sigma^2\right)^{1 / 2}} \exp \left(-\frac{(x-\mu)^2}{2 \sigma^2}\right) $ - $68.3 \%$ of the distribution lies within one standard deviation of the mean, $\mu \pm \sigma$ - $95.5 \%$ lies within $\mu \pm 2 \sigma$ - $99.7 \%$ lies within $\mu \pm 3 \sigma$ ***Central Limit Theorem***: Let $X_1, X_2, \ldots$ be independent with $\mathbb{E} X_i=\mu_i, \operatorname{var}\left(X_i\right)=v_i$. Then $ \frac{\left(X_1+\cdots+X_n\right)-\left(\mu_1+\cdots \mu_n\right)}{\sqrt{v_1+\cdots+v_n}} \stackrel{d}{\longrightarrow} N(0,1) $ **Fitting Gaussian to Data:** Given: Data points $x_1, \ldots, x_n$ to which we want to fit a distribution. What Gaussian distribution $N\left(\mu, \sigma^2\right)$ should we choose? - *Empirical mean*: $\mu=\frac{1}{n}\sum\limits_{i=1}^n x_i$ - *Empirical variance*: $\sigma^2=\frac{1}{n}\sum\limits_{i=1}^n(x_i-\mu)^2$ ### 1.2 Poison Distribution ![[Pasted image 20230503160605.png|200]] ***Poisson distribution***: $\text { A distribution over the non-negative integers }\{0,1,2, \ldots\}$. $\text{Poisson}(\lambda)$, with $\lambda>0$ : $\begin{align*} \operatorname{Pr}(X=k)&= e^{-\lambda} \frac{\lambda^k}{k !}\\ \end{align*}$ - *Mean*: $\mathbb{E} X=\lambda$ - *Variance*: $\mathbb{E}(X-\lambda)^2=\lambda$. *Proof*. $\begin{align*} \sum\limits_{k=0}^\infty e^{-k}\frac{\lambda^k}{k!}&= e^{-\lambda}(1+\lambda+\frac{\lambda^2}{2!}+\frac{\lambda^3}{3!}+\cdots)=e^{-\lambda}e^{\lambda}=1\\ \mathbb{E}[X]&= \sum\limits_{k=0}^\infty k\text{Pr(X=k)}=\sum\limits_{k=0}^\infty ke^{-k}\frac{\lambda^k}{k!}=e^{-\lambda}\sum\limits_{k=1}^\infty\frac{\lambda^k}{(k-1)!}\\ &= e^{-\lambda}\lambda\sum\limits_{k=1}^\infty\frac{\lambda^{k-1}}{(k-1)!}=e^{-\lambda}\lambda\sum\limits_{k=0}^\infty\frac{\lambda^k}{k!}=\lambda\\ \end{align*}$ **Definition/Origin**: Count the number of events (collisions, phone calls, etc) that occur in a certain interval of time. Call this number $X$, and say it has expected value $\lambda$. - Now suppose we divide the interval into small pieces of equal length - If the probability of an event occurring in a small interval is: 1) *independent* of what happens in other small intervals, and 2) the same across small intervals - Then: $X\sim \text{Poisson}(\lambda)$ *Example*: Rutherford's experiments with radioactive disintegration ![[Pasted image 20230503162141.png|400]] *Example*: Flying bomb hitting London in WWII ![[Pasted image 20230503170459.png|400]] ### 1.3 Binomial distribution $\operatorname{Binomial}(n, p): \text { \# of heads from } n \text { independent coin tosses of bias (heads prob) } p \text {. }$ $\begin{aligned} \operatorname{Pr}(X=k) =\left(\begin{array}{l}n \\k\end{array}\right) p^k(1-p)^{n-k}\quad\Bigg|\quad \mathbb{E} X =n p \quad\Bigg|\quad \operatorname{var}(X) =n p(1-p) \end{aligned}$ *Proof*. For the $i$-th coin flip $x_i= \begin{cases}1 & \text { if heads } \\ 0 & \text { if tails }\end{cases}$. Then $X=\sum\limits_{i=1}^nX_i$. $\begin{align*} \mathbb{E}[X_i]=p &\quad\rightarrow\quad \mathbb{E}[X]=n\mathbb{E}[X]=np\\ \\ \text{var}(X_i)&= \mathbb{E}[X_i^2]-(\mathbb{E}(X_i))^2=p-p^2 =p(1-p) \\ &\quad\rightarrow\quad\text{var(X)}=np(1-p) \end{align*}$ *Example*: Survey on food tastes. - You choose 1000 people at random and ask them whether they like sushi. 600 say yes. - What is a good estimate for the fraction of people who like sushi? Clearly, $60 \%$. ### 1.4 Central limit theorem *Claim*. Binomial is well approximated by Gaussian when $p$ is reasonably greater than $0$. ![[Pasted image 20230506154423.png|400]] When a coin of bias $p$ is tossed $n$ times, let $S_n$ be the number of heads. - We know $S_n$ has *mean* $n p$ and *variance* $n p(1-p)$. - By ***central limit theorem***: As $n$ grows, the distribution of $S_n$ looks increasingly like a Gaussian with this mean and variance, i.e., $ \frac{S_n-n p}{\sqrt{n p(1-p)}} \stackrel{d}{\longrightarrow} N(0,1) $ - Note: $\overset{d}{\longrightarrow}$ means "converges in distribution to". ### 1.5 Poisson limit theorem *Claim*. Binomial is well approximated by Poisson when $p$ is close to $0$ Toss coins with bias $p_1, \ldots, p_n$ and let $S_n$ be the number of heads. ***Le Cam's inequality***. Let $\lambda=p_1+\cdots+p_n$: $ \sum_{k=0}^{\infty}\left|\operatorname{Pr}\left(S_n=k\right)-e^{-\lambda} \frac{\lambda^k}{k !}\right| \leq \sum_{i=1}^n p_i^2 $ - An upper bounded for the $L_1$ distance between $Sn$ and $\text{Poisson}(\lambda)$ ***Poisson limit theorem***: If all $p_i=\lambda / n$, then $ S_n \stackrel{d}{\longrightarrow} \text { Poisson }(\lambda) $ - Also called "the law of rare events" or "the law of small numbers" ### 1.6 Multinomial Distribution $\begin{align*} \operatorname{Pr}\left(n_1, \ldots, n_k\right)=\left(\begin{array}{c}n \\ n_1, n_2, \ldots, n_k\end{array}\right) p_1^{n_1} p_2^{n_2} \cdots p_k^{n_k} \quad\Bigg\vert\quad \mathbb{E} X=\left(n p_1, n p_2, \ldots, n p_k\right) \end{align*}$ Imagine a $k$-faced die, with probabilities $p_1, \ldots, p_k$. Toss such a die $n$ times, and count the number of times each of the $k$ faces occurs. - The distribution of $X=\left(X_1, \ldots, X_k\right)$ is called the ***multinomial***. $X_j=\#$ of times face $j$ occurs - Parameters: $p_1, \ldots, p_k \geq 0$, with $\sum\limits_{i=1}^kp_i=1$. Note that $\sum\limits_{i=1}^k n_i=n$ - $\left(\begin{array}{c}n \\n_1, n_2, \ldots, n_k\end{array}\right)=\frac{n !}{n_{1} ! n_{2} ! \cdots n_{k} !}$ is the $\#$ of ways to place balls numbered $\{1, \ldots, n\}$ into bins numbered $\{1, \ldots, k\}$. *Example*. Bag-of-words representation of text document. - Fix $V=$ some vocabulary. $|V|=n$ - Treat words in document as independent draws from a multinomial over $V$ : $ p=\left(p_1, \ldots, p_{|V|}\right), \text { such that } p_i \geq 0 \text { and } \sum_i p_i=1 $ ## 2 Maximum Likelihood Estimation (MLE) Let $\mathcal{P}=\left\{P_\theta: \theta \in \Theta\right\}$ be a class of probability distributions (Gaussians, Poissons, etc). ***Maximum likelihood principle***: pick the $\theta \in \Theta$ that makes the data maximally likely, that is, maximizes $\operatorname{Pr}($ data $\mid \theta)=P_\theta$ (data). Steps: 1. Write down an expression for the likelihood, $\operatorname{Pr}(\text{data}|\theta)$. Goal is to find $\theta$ what maximizes $\text{Pr}( \text{data}|\theta)$ 2. Maximizing this is the same as maximizing its log, the *log-likelihood*. 3. Solve for the maximum-likelihood parameter $\theta$. ### 2.1 MLE for Poisson Recall Poisson distribution: $\text{Pr}(x|\lambda)=e^{-\lambda}\frac{\lambda^x}{x!}$. Suppose $\mathcal{P}=\{\text { Poisson }(\lambda): \lambda>0\} . \text { We observe } x_1, \ldots, x_n$. 1. Write down an expression for the likelihood, $\operatorname{Pr}(\text{data}|\lambda)$. $\begin{align*} \text{Pr}(x_1,\cdots,x_n|\lambda)&= \prod_{i=1}^n\text{Pr}(x_i|\lambda)=\prod_{i=1}^n e^{-\lambda}\frac{\lambda^{x_i}}{x_i!}\\ &= \frac{e^{-n\lambda}\lambda^{x_1+\cdots+x_n}}{x_1!\cdots x_n!} \end{align*}$ 2. Maximizing this is the same as maximizing its log, the log-likelihood. $\begin{align*} \text{LL}=\log\mathcal{L}(\lambda)=-n \lambda+ \ln \lambda\sum\limits_{i=1}^nx_i-\ln(x_1!\cdots x_n!) \end{align*}$ 3. Solve for the maximum-likelihood parameter $\theta$. $\begin{align*} \frac{\partial }{\partial \lambda}\log\mathcal{L}(\lambda)&= -n+ \frac{1}{\lambda}\sum\limits_{i=1}^nx_i=0\\ \rightarrow \lambda&= \underset{\text{empirical mean}}{\underbrace{ \frac{1}{n}\sum\limits_{i=1}^nx_i }} \end{align*}$ ### 2.2 MLE for Gaussian You see $n$ data points $x_1, \ldots, x_n \in \mathbb{R}$, and want to fit a Gaussian $N\left(\mu, \sigma^2\right)$ to them. 1. Maximum likelihood: pick $\mu, \sigma$ to maximize $ \operatorname{Pr}\left(\text { data } \mid \mu, \sigma^2\right)=\prod_{i=1}^n\left(\frac{1}{\left(2 \pi \sigma^2\right)^{1 / 2}} \exp \left(-\frac{\left(x_i-\mu\right)^2}{2 \sigma^2}\right)\right) $ 2. Work with the log, since it makes things easier: $ \mathrm{LL}\left(\mu, \sigma^2\right)=\frac{n}{2} \ln \frac{1}{2 \pi \sigma^2}-\sum_{i=1}^n \frac{\left(x_i-\mu\right)^2}{2 \sigma^2} $ 3. Setting the derivatives to zero, we get $ \begin{aligned} \mu & =\frac{1}{n} \sum_{i=1}^n x_i &\quad\text{(empirical mean)}\\ \sigma^2 & =\frac{1}{n} \sum_{i=1}^n\left(x_i-\mu\right)^2 &\quad\text{(empirical variance)} \end{aligned} $ ### 2.3 MLE for Binomial Say you observe $n$ tosses of a coin of unknown bias, and $k$ come up heads. What distribution binomial $(n, p)$ is the best fit to this data? - Choose $p=k/n$ which maximizes $\text{Pr}(k|p)$: $\begin{align*} \text{Pr}(k|p )&= p^k(1-p)^{n-k}\\ \text{LL}(p)&= k\ln p+(n-k)\ln(1-p)\\ \frac{\partial }{\partial p}LL&= \frac{k}{p}-\frac{n-k}{1-p}=0\\ \rightarrow p&= \frac{k}{n} \end{align*}$ *Caveat*. Say have two coins of unknown bias. - Toss 1st coin 10 times, got 10 heads. MLE yields estimated bias: $p_1=1$ - Toss 2nd coin 10 times, got 1 head. MLE yields estimated bias: $p_2=0.1$ - Toss a coin 20 times, got 19 heads. Which coin is more like to be the one being tossed? $\begin{align*} \text{Pr}( \text{19 heads from 20 tosses}|p_1=1)&= 0 \quad\text{counterintuivite!}\\ \text{Pr}( \text{19 heads from 20 tosses}|p_1=0.1)&\gt 0 \end{align*}$ - **We never want MLE to give and estimated $p=1$ or $p=0$**. ***Laplace smoothing***. Smooth MLE by estimating bias using the alternative formula: $\begin{align*} p&= \frac{k+1}{n+2} \end{align*}$ *Example*. *Laplace's law of succession*: What is the probability that the sun won't rise tomorrow? - Let $p$ be the probability that the sun won't rise on a randomly chosen day. We want to estimate $p$. - For the past 5000 years (= 1825000 days), the sun has risen every day. Using Laplace smoothing, estimate $p=\frac{1}{1825002}$ ### Other distribution families 1. *Gamma*: two-parameter family of distributions over $\mathbb{R}^{+}$ 2. **Beta: two-parameter family of distributions over $[0,1]$ 2. *Dirichlet*: k-parameter family of distributions over the k-probability simplex All of these are exponential families of distributions. ## 3 MLE Alternatives and Desiderata for Estimators Choosing a model in $\left\{P_\theta: \theta \in \Theta\right\}$ given observations $x_1, x_2, \ldots, x_n$. *MLE*. Most common. - Pick $\theta$ that maximizes $\prod_{i=1}^np_\theta(x_i)$ *Method of moments*. Pick the model whose moments $\mathbb{E}_{X \sim P_\theta} f(X)$ match empirical estimates. *Bayesian estimation*. Return the maximum a-posteriori distribution, or the overall posterior. - Put a prior $\pi$ on the space $\Theta$. $\text{posterior}(\theta)\propto(\theta)\prod_{i}p_\theta(x_i)$ *Maximum entropy*. We'll see this soon. *Other* optimization-based or game-theoretic criteria. As in generative adversarial nets, for instance. - *Noise-contrastive estimation* ### Desiderata for probability estimators Goal: Given data $x_1, \ldots, x_n$, want to choose a model $P_\theta, \theta \in \Theta$. - Let $T\left(x_1, \ldots, x_n\right)$ be some estimator of $\theta$. - Suppose $X_1, \ldots, X_n$ are i.i.d. draws from $P_\theta$. Ideally $T\left(X_1, \ldots, X_n\right) \approx \theta$. Some typical *desiderata*, if $X_1, \ldots, X_n \sim P_\theta$. 1. Unbiased: $\mathbb{E} T\left(X_1, \ldots, X_n\right)=\theta$. 2. *Asymptotically consistent*: $T\left(X_1, \ldots, X_n\right) \rightarrow \theta$ as $n \rightarrow \infty$. 3. Low variance: $\operatorname{var}\left(T\left(X_1, \ldots, X_n\right)\right)$ is small. 4. Computationally feasible: Is $T\left(X_1, \ldots, X_n\right)$ easy to compute? ### MLE vs. Desiderata *Is MLE* *unbiased*? In general, no. Example: Fit a normal distribution to observations $X_1, \ldots, X_n \sim N\left(\mu, \sigma^2\right)$. - MLE yields: $ \begin{aligned} \widehat{\mu} & =\frac{X_1+\cdots+X_n}{n} \quad\quad \widehat{\sigma}^2 =\frac{\left(X_1-\widehat{\mu}\right)^2+\cdots+\left(X_n-\widehat{\mu}\right)^2}{n} \end{aligned} $ - Can check that $\mathbb{E}[\widehat{\mu}]=\mu$ (empirical mean is unbiased) but the *empirical variance is biased*. $ \mathbb{E}\left[\widehat{\sigma}^2\right]=\frac{n-1}{n} \sigma^2 $ *Is MLE asymptotically consistent*? Not always, but under some conditions, yes. Intuition: Given data $X_1, \ldots, X_n \sim P_{\theta^*}$, want to choose a model $P_\theta, \theta \in \Theta$. I.e. pick the $\theta$ that maximizes $\begin{aligned} \frac{1}{n} L L(\theta)=\frac{1}{n} \sum_{i=1}^n \ln P_\theta\left(X_i\right) & \rightarrow \mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_\theta(X)\right] \quad\text{(Monte-Carlo estimate)}\\ & =\mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_{\theta^*}(X)\right]-KL\left(P_{\theta^*}, P_\theta\right)\\\\ \leftrightarrow \underset{\theta}{\arg\max}LL(\theta)&= \underset{\theta}{\arg\min}\;KL(P_{\theta^*}, P_\theta) \end{aligned}$ WTS: $\mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_\theta(X)\right]=\mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_{\theta^*}(X)\right]-KL\left(P_{\theta^*}, P_\theta\right)$. *Proof*. $\begin{align*} & \mathbb{E}_{X_{\sim} \sim P_{\theta^*}}\left[\ln P_{\theta^*}(x)\right]-\mathbb{E}_{X \sim P_{\theta^*}}\left[\ln P_\theta(x)\right] \\ & =\sum_x P_{\theta^*}(x) \ln P_{\theta^*}(x)-\sum_x P_{\theta^*}(x) \ln p_\theta(x) \\ & =\sum_x P_{\theta^*}(x) \ln \frac{P_{\theta^*}(x)}{P_\theta(x)}=KL\left(P_{\theta^*}, P_\theta\right) \geq 0 \end{align*}$ $\begin{align*} \end{align*}$ ## Reference [information theory - Why KL divergence is non-negative? - Cross Validated (stackexchange.com)](https://stats.stackexchange.com/questions/335197/why-kl-divergence-is-non-negative)