The maximum likelihood estimator (MLE) is an important [[estimator]] in [[statistics]]. The basic idea is to select the value in the parameter space that makes the observed data "most likely". Given the data $X_1, X_2, \dots, X_n$, an [[independent and identically distributed|iid]] random sample from a distribution with unknown parameter $\theta$, we want to find the value of $\theta$ in the parameter space that maximizes our "probability" of observing the data. For discrete random variables, we are evaluating the [[joint probability mass function]] $P(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n)$, in other words the probability that each random variable takes on the observed value. Express this probability as a function of $\theta$ (we call this function a "likelihood") and find the $\theta$ that gives you the [[function maximum]]. For continuous random variables, the probability $P(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n)$ can be expressed as the [[joint probability density function]]. $f(\vec{x}; \theta) = \prod_{i=1}^n f(x_i; \theta)$ The joint pmf/pdf is called a [[likelihood]] function $L(\theta$) and includes all proportional variations of the function because the maximum of the function will always be located at the same value for $\theta$ regardless of how "tall" it is. Thus, you can drop any constants of proportionality when calculating the likelihood as they will not affect the outcome (including any terms that are only dependent on $x$, as the $xs (data) are fixed (i.e., considered constants). Often it is easiest to take the $\log$ of the pdf to calculate the maximum of the function, which is referred to as "log-likelihood". This is a simple convenience and not a requirement. To solve for multiple parameters, take the partial derivative with respect to each parameter and solve the system of equations. ## variance of the MLE The variance of the MLE is calculated much the same way as any other [[variance]]. The computational formula for variance of an estimator $\hat \theta$ is $V(\hat \theta) = E(\hat \theta^2) - (E(\hat \theta))^2$ The variance of the MLE can also be expressed as the inverse of the Fisher information, $I_n^{−1}(\theta)$, meaning that as the Fisher information increases (i.e., as we get more information from the data), the variance of the MLE decreases, and the estimator becomes more precise. ## efficiency The efficiency of an estimator is given by $\frac{CRLB_\theta}{V(\hat \theta)}$ where CRLB is the [[Cramér-Rao Lower Bound]] on the variance for all unbiased estimators and $V[\hat \theta_n]$ is the variance of the estimator. If the efficiency as $n \to \infty$ is $1$, the estimator is said to be asymptotically efficient. ## invariance property The invariance property of MLEs allows you to estimate the function of a parameter by estimating the parameter using maximum likelihood estimation and plugging the result into the function. For example, if you are interested in estimating $\lambda^2$ of the [[Poisson distribution]], you can estimate $\lambda$ and square it. ## other properties Under certain "regularity conditions", for the MLE $\hat \theta$ of $\theta$ - $\hat \theta$ exists and is unique - $\hat \theta$ [[converges in probability]] to $\theta$ ( $\hat \theta \overset{P}{\to} \theta$ ), so we say the MLE is a [[consistent estimator]] of $\theta$ - $\hat \theta$ is an asymptotically unbiased estimator of $\theta$ - $\hat \theta$ is asymptotically efficient (as $n$ approaches infinity, the ratio of the [[Cramér-Rao Lower Bound]] and $Var[\hat \theta]$ approaches $1$) - Most MLEs are asymptotically normal, or [[converges in distribution]] to the [[normal distribution]] as $n$ goes to infinity. ## discrete example Given data from a [[Bernoulli distribution]] $X \overset{iid}{\sim} Bern(p)$, find the MLE for $p$. The pmf for a single $x$ is $f(x;p) = p^x(1-p)^{1-x} \ I_{0,1}(x)$ Because the data are [[independent and identically distributed]] (iid), according to the [[multiplication rule]], we can calculate the joint pmf by multiplying all individual pmfs together. Thus, the joint pmf for all $xs is the product from $1$ to $n$ of all individual pdfs. $f(\vec{x};p) = \prod_{i=1}^n f(x_i; p) = \prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} \ I_{0,1}(x_i)$ To simplify the above, we can expand the products in the exponents and notice that a product of exponents with the same base can be written as the base raised to the sum of each exponent from the [[exponent rules]]. $p^{x_1} * p^{x_2} * \dots * p^{x_n} = p^{\sum x_i}$ for $i=1$ to $n$. Similarly notice that the exponent of the second term will resolve to the sum of $(1 - x_i)$ from $i=1$ to $n$, where there will be $n$ $1s to sum, which will equal $n$ and $n$ $x_is to subtract. $(1 - p)^{1 - x_1} * \dots * (1 - p)^{1 - x_n}= (1 - p)^{n - \sum x_i}$ We can drop the indicator expression as that is constant with respect to $p$ and will not affect the calculation of the function maximum. Thus for the purposes of maximum likelihood estimation, the joint pmf can be simplified to $f(\vec{x}; p) = p^{\sum x_i}(1 - p)^{n - \sum x_i}$ I've simplified the summation notation for legibility, but note that all sums are from $i=1$ to $n$. Finally, we rename the joint pdf the likelihood function. $L(p) = p^{\sum x_i}(1 - p)^{n - \sum x_i}$ Let's take the log of both sides to bring down the exponents and simplify using the [[logarithm rules]]. Note that the log-likelihood is expressed as $\ell$. $\ell(p) = \ln L(p) = \sum_{i=1}^n x_i \ln p + (n - \sum_{i=1}^n x_i) \ln (1-p)$ Now we're ready to calculate the function maximum. To maximize a function, we calculate the [[derivative]] and set it equal to 0. Recall that the derivative of a log is the derivative of the thing over the thing itself. Also note that the $x's$ are fixed and so we can consider them constants. $\frac{\delta}{\delta p} \ell(p) = \frac{\sum x_i}{p} - \frac{n -\sum x_i}{1-p} = 0$ To solve for $p$, we can multiply both sides of the equation by $p(1-p)$ to clear the denominator. $(1 - p)\sum x_i - p \Big ( n - \sum x_i \Big) = 0$ Multiplying the terms through and simplifying we will get $\hat p = \frac{\sum_{i=1}^n X_i}{n}= \bar X$ Notice that the little $xs have become uppercase to indicate they are random variables. We can see that the MLE for $p$ of a sample from a Bernoulli distribution is the sample mean, which for the Bernoulli distribution is equivalent to the proportion of $1$s in the observed data, and so this estimator makes a lot of sense. ## continuous example Given data from an [[exponential distribution]] $X \overset{iid}{\sim} exp(rate = \lambda)$, find the MLE for $\lambda$, calculate the variance of the estimator, and find the efficiency. ### find the MLE The pdf for one $x_i$ is $f(x; \lambda) = \lambda e^{-\lambda x} \ I_{0,\infty}(x)$ The joint pdf is $f(\vec{x}; \lambda) = \prod_{i=1}^n \lambda e^{-\lambda x_i} \ I_{0,\infty}(x_i)$ Distributing the product, re-writing with summation notation, we get $f(\vec{x}; \lambda) = \lambda^n e^{-\lambda \sum x_i} \ I_{0,\infty}(x_i)$ A likelihood is $L(\lambda) = \lambda^n e^{-\lambda \sum x_i}$ The log-likelihood is $\ell(\lambda) = n \ln \lambda - \lambda \sum_{i=1}^n x_i$ Maximizing by taking the derivative and setting equal to 0. $\frac{\delta}{\delta \lambda} \ell(\lambda) = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0$ Finally solve for $\lambda$. $\hat \lambda = \frac{n}{\sum_{i=1}^n X_i} = \frac{1}{\bar X}$ ### calculate the variance Continuing this example, let's calculate the variance of $\hat \lambda$. The computation formula for variance is $V(\hat \lambda) = E(\hat \lambda^2) - (E(\hat \lambda))^2$ Let's begin with the first term $E(\hat \lambda^2)$. Substituting in the estimator for $\hat \lambda$ we found above, we have $E(\hat \lambda^2) = E\Big [\Big (\frac{1}{\bar X} \Big)^2 \Big ]$ Noting that $\bar X = \frac1n \sum_{i=1}^n X_i$ we can re-write the sum of exponential random variables as a gamma random variable $Y \sim \Gamma(n, \lambda)$. Substituting $Y$ for the sum of $Xs we have $E(\hat \lambda^2) = E\Big [\Big (\frac{1}{\bar X} \Big)^2 \Big ] = E \Big[ \Big( \frac{n}{\sum_{i=1}^n X_i} \Big )^2 \Big ] = E\Big [\Big (\frac{n}{Y}\Big)^2 \Big ] = E\Big [\Big (\frac{n^2}{Y^2}\Big) \Big ]$ Integrating to find this expectation we have $E\Big [\Big (\frac{n^2}{Y^2}\Big) \Big ] = n^2 \int_{-\infty}^{\infty} \frac{1}{y^2} f_Y(y) \ dy = n^2 \int_{-\infty}^{\infty} \frac{1}{y^2} \frac{1}{\Gamma(n)} \lambda^n y^{n-1}e^{-\lambda y} \ dy$ Combining the $1/ y^2 = y^{-2}$ and $y^{n-1}$ we get $y^{-2 + n - 1} = y^{n-3}$ and so we can simplify to $E\Big [\Big (\frac{n^2}{Y^2}\Big) \Big ] = n^2 \int_{-\infty}^{\infty} \frac{1}{\Gamma(n)} \lambda^n y^{n-3}e^{-\lambda y} \ dy$ Notice that the integrand looks very similar to a gamma distribution $\Gamma(n-2, \lambda)$. To use the [[integration by pdf]] trick, we can pull out constants to transform this to the gamma pdf. Integrating the gamma pdf over all values will give us the value $1$. $n^2 \int_{-\infty}^{\infty} \frac{1}{\Gamma(n)} \lambda^n y^{n-3}e^{-\lambda y} \ dy = n^2 \lambda^2\frac{\Gamma(n-2)}{\Gamma(n)} \int_{-\infty}^{\infty} \frac{1}{\Gamma(n-2)} \lambda^{n-2} y^{n-3}e^{-\lambda y} \ dy$ We can simplify this to $E\Big [\Big (\frac{n^2}{Y^2}\Big) \Big ] = n^2 \lambda^2 \frac{\Gamma(n-2)}{\Gamma(n)}$ Recall that the [[gamma function]] has the recursive property $\Gamma(n) = (n - 1)\Gamma(n - 1)$ Thus the gamma functions in the numerator and denominator can be expressed as $\frac{\Gamma(n-2)}{\Gamma(n)} = \frac{\Gamma(n-2)}{(n-1)(n-2)\Gamma(n-2)} = \frac{1}{(n-1)(n-2)}$ With this we can resolve the first term. $E\Big [\Big (\frac{n^2}{Y^2}\Big) \Big ] = \frac{n^2 }{(n-1)(n-2)} \lambda^2$ Now let's find the second term $(E(\hat \lambda))^2$. #expand $(E(\hat \lambda))^2 =\frac{n}{n-1}\lambda$ Finally, bringing the first and second terms of the computational formula for variance together we have $V(\hat \lambda) = \frac{n^2 }{(n-1)(n-2)} \lambda^2 - \Big ( \frac{n}{n-1}\lambda \Big)^2 = \frac{n^2}{(n-1)^2(n-2)} \lambda^2$ ### calculate the efficiency To calculate the efficiency, we'll calculate the ratio of the CRLB to the variance (multiplying the CRLB by the inverse of the variance for notational compactness). $\frac{CRLB_\theta}{V(\hat \theta)} = \frac{\lambda^2}{n} * \frac{(n-1)^2(n-2)}{n^2\lambda^2} = \frac{(n-1)^2(n-2)}{n^3}$ Note that $n$ is a third degree polynomial in the numerator and denominator with both coefficients being $1$, and so the limit as $n \to \infty$ approaches $1/1 = 1$. That means the MLE is asymptotically efficient. ## example with multiple parameters Given data from a [[normal distribution]] $X \overset{iid}{\sim} N(\mu, \sigma^2)$, find the MLE for $\mu$ and $\sigma^2$. The pdf for a single $X$ is (note the indicator variable is not necessary as the function is defined from $\infty$ to $\infty$). $f(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2 \sigma^2}(x - \mu)^2}$ The joint pdf for all $Xs is $f(\vec{x}; \mu, \sigma^2) = \prod_{i=}^n f(x_i; \mu, \sigma^2) = (2 \pi \sigma^2)^{-n/2} e^{-\frac{1}{2 \sigma^2} \sum(x_i - \mu)^2}$ A likelihood is $L(\mu, \sigma^2) = (2 \pi \sigma^2)^{-n/2} e^{-\frac{1}{2 \sigma^2} \sum(x_i - \mu)^2}$ The log-likelihood is $\ell(\mu, \sigma^2) = -\frac{n}{2} \ln(2 \pi \sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$ To solve for multiple parameters, take the partial derivative with respect to each parameter and solve the system of equations. In this case we need to calculate both $\frac{\delta}{\delta \mu} \ell(\mu, \sigma^2) = 0 $ $\frac{\delta}{\delta \sigma^2} \ell(\mu, \sigma^2) = 0 $ The derivative with respect to $\mu$ is $\frac{\delta}{\delta \mu} \ell(\mu, \sigma^2) = -\frac{1}{2 \sigma^2} \sum_{i=1}^n2(x_i- \mu)(-1) = \frac{1}{\sigma^2} \sum_{i=1}^n(x_i- \mu)$ > [!NOTE] > In general, you will need to set both derivatives equal to 0 and solve the system of equations. In this case, we can actually solve for $\mu$ directly. Setting the derivative equal to 0, we can note that the sum must be 0 for the function to equal 0. $\frac{1}{\sigma^2} \sum_{i=1}^n(x_i- \mu) = 0$ Extracting the sum term and running the sum through we can simplify to $\sum_{i-1}^n (x_i - \mu) = \sum_{i=1}^n x_i - n\mu =0$ Solving for $\mu$ we get $\hat \mu = \sum_{i=1}^n \frac{x_i}{n} = \bar X$ The derivative with respect to $\sigma^2$ is $\frac{\delta}{\delta \sigma^2} \ell(\mu, \sigma^2) = -\frac{n}{2} \frac{2 \pi}{2 \pi \sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \mu)^2$ We can simplify the equation further, notice that the constants of proportionality $2\pi$ could have been dropped earlier. $\frac{\delta}{\delta \sigma^2} \ell(\mu, \sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \mu)^2$ Since we have already calculated the estimator for $\mu$ as $\bar x$, we can substitute into the equation and set equal to $0$. $\frac{\delta}{\delta \sigma^2} \ell(\mu, \sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^n (x_i - \bar x)^2 = 0$ We can multiply both sides by $2(\sigma^2)^2$, simplify, and solve for $\sigma^2$ to get $\hat \sigma^2 = \frac{\sum_{i=1}^n (X_i - \bar X)^2}{n}$ This estimator for $\sigma^2$ is known as the [[sample variance]]. Put hats on the estimators and make the variables capital letters to denote random variables. ## example when the parameter is part of the support of the distribution Given data from a [[uniform distribution]] $X \overset{iid}{\sim} unif(0,\theta)$, where $\theta$ is the upper bound of the distribution, find the MLE for $\theta$. The pdf for one $X$ is $f(x; \theta) = \frac{1}{\theta} \ I_{(0,\theta)}(x)$ The joint pdf for all $X's$ is $f(\vec{x}; \theta) = \prod_{i=1}^n f(x_i; \theta) = \frac{1}{\theta^n} \prod_{i=1}^n \ I_{(0,\theta)}(x_i)$ In most cases, we would drop the indicator variable. In this case, the indicator includes the parameter $\theta$ so we cannot drop it. However, the indicators simply define the range of the function (as in a piecewise function) and so they can be ignored during any operations like logs and derivatives. They just come along for the ride. For now, we'll file our indicator away but bring it back when we need it. A likelihood is $L(\theta) = \frac{1}{\theta^n}$ The log-likelihood is $\ell(\theta) = -n \ln \theta$ Setting the derivative with respect to theta equal to $0$ we get $\frac{\delta}{\delta\theta}\ell(\theta) = -\frac{n}{\theta} = 0$ In this case, the solution appears to be $\theta=0$, however that would mean we couldn't have any data! In this case, the derivative is not a good approach to finding the maximum of the likelihood function. We can note that our likelihood function $L(\theta) = 1/\theta^n$ is a decreasing function of $\theta$. To maximize $L(\theta)$, we need to take $\theta$ as small as possible. Given the data, what value of $\theta$ is the smallest possible? We can bring back out indicator variable and see that the $x_is (our data), range from $0$ to $\theta$. $L(\theta) = \frac{1}{\theta^n} \prod_{i=1}^n \ I_{(0,\theta)}(x_i)$ Therefore, the smallest value that $\theta$ can take is actually the maximum value of our data. Our MLE for $\theta$ is therefore $\hat \theta = \max(X_1, X_2, \dots, X_n)$ > [!Tip] Additional Resources- > - https://www.youtube.com/watch?v=n8YcOUZRZy8 Example of two coin tosses and parameter $p$ 0.2, 0.5, 0.8 (biased tails, fair, biased head). #expand