# Maximum Entropy and Exponential Family [[lec12_maximum_entropy.pdf]] ## 1 Maximum Entropy Distribution Modeling Suppose we have a prior distribution $\pi$ on $S$ (e.g. distribution of a broader class of birds). $\begin{align} &\min KL(p, \pi) & \quad\text{(want distr. \textit{p} close to }\pi) \\ &\sum_{x \in S} p_x T_i(x)=b_i, \quad 1 \leq i \leq k & \quad\text{(observed constraint)}\\ &p_x \geq 0,\; x \in S ,\quad\sum_{x \in S} p_x=1 & \quad\text{(constraint on } p_x \text{ being pmf over } S) \end{align}$ - If $\pi=U(S)$, then it's exactly the previous formulation - Why not $K(\pi,p)$: KL divergence is not commutative. Want the smooth distribution to be $q(x)$ on the denominator. $\begin{align*} KL(p,q)=\sum\limits_Xp(x)\log\frac{p(x)}{\underset{\text{cannot be 0}}{\boxed{q(x)}}} \end{align*}$ *Solution*. $\begin{align*} \boxed{p(x)=\frac{1}{Z} \exp \left(\sum_{i=1}^k \theta_i T_i(x)\right) \pi(x)=\frac{1}{Z} e^{\theta^\top T(x)} \pi(x)}\\ \end{align*}$ - $T(x)=\left(T_1(x), \ldots, T_k(x)\right)$ and $\theta=\left(\theta_1, \ldots, \theta_k\right)$. - By varying $(S, T, \pi)$, we get Gaussians, Poissons, binomials, Markov random fields... These are all *exponential families*. *Derive solution via Lagrange multipliers* $\begin{align*} F(p)&= \sum_x p_x \ln \frac{p_x}{\pi_x}-\sum_{i=1}^{k \in S} \theta_i\left(\sum_x p_x T_i(x)-b_i\right)-\nu\left(\sum_x p_x-1\right) \\ \frac{d F}{d p_x}&= p_x \frac{1}{p_x}+\ln p_x-\ln \pi_x-\sum_{i=1}^k \theta_i T_i(x)-\gamma \\ \quad \frac{d F}{d p_x}&= 0 \Rightarrow \ln \frac{p_x}{\pi_x}=\sum_{i=1}^k \theta_i T_i(x)+\nu-1\\ p_x&= \frac{1}{z}{\pi_x\exp\left(\sum_{i=1}^k \theta_i T_i(x) \right)} \end{align*}$ ### 1.1 Motivation: model geo-distribution of birds *Problem set-up*. - Observes $\boxed{1611}$ data points. Desired resolution: $386\times 286$ grid = $\boxed{S=110,396}\gg 1611$ pixels! - Want a distribution over $S=\{$ locations $\}, |S|=110,396$ - Represent each pixel $x \in S$ by environmental features $T(x)=\left(T_1(x), \ldots, T_k(x)\right)$ - Say $T_1(x)$ is the elevation, $T_k(x)$ rainfall, etc (slope, precipitation, ...). Use data to estimate $\mathbb{E}T_i(x)$ for $i\in[1,k]$ - $\mathbb{E}T_1=$ Avg. elevation at which this bird is seen. *The maximum entropy approach*. 1. Suppose we had no sightings at all. The most reasonable model, in the absence of any information, might just be the uniform distribution over $S$. 2. But we have some sightings, yielding estimates $\mathbb{E} T_i(x)=b_i, 1 \leq i \leq k$. Pick a distribution $p$ over $S$ that respects these constraints, i.e., for all $i$: (but is otherwise *as random as possible*) $\begin{align*} \sum_{x \in S} p(x) T_i(x)=b_i \end{align*}$ 3. Want a distribution $p$ over $S$ such that avg. elevation, ... matches empirical observations *Formalizing as a ==convex optimization== problem*. - Domain: finite set $S$, e.g. geographical locations - Features $T: S \rightarrow \mathbb{R}^k$, e.g. environmental features - Observed *constraints*: $\mathbb{E} T_i(x)=b_i$, for $i=1, \ldots, k$ Find the distribution $p$ on $S$ that has maximum entropy subject to the constraints. $ \begin{gathered} \max \sum_{x \in S} p_x \ln \frac{1}{p_x} &\quad\text{(maximize entropy)}\\ \sum_{x \in S} p_x T_i(x)=b_i, \quad 1 \leq i \leq k & \quad\text{(observed constraints)}\\ p_x \geq 0, \; x \in S ,\quad \sum_{x \in S} p_x=1 &\quad (\text{constrain } p_x\text{ to be a distribution over }S) \end{gathered} $ *Solution*. $ p(x) \propto \pi(x) \exp \left(\theta_1 \cdot \operatorname{avgtem} p(x)+\theta_2 \cdot \text { elevation }(x)+\cdots\right) $ - E.g. $\theta_2=0.81 \Longrightarrow$ each additional unit of elevation multiplies the probability by $e^{0.81} \approx 2.25$ ## 2 Entropy and KL Divergence The ***entropy*** of distribution $p$ over finite set $S$ is $ H(p)=\sum_{x \in S} p(x) \log \frac{1}{p(x)} $ - Characterizes how *random* a distribution is - *Concave* w.r.t. to $p$ *Example*. Calculate entropy for the following coins. $S=\set{0,1}$ $\begin{align*} \begin{array}{c|c|c} \text{Fair}&p(0)=p(1)=\frac{1}{2}&H(p)=\log 2=1\\ \text{Bias } 3/4&p(0)=\frac{1}{4},p(1)=\frac{3}{4}&H(p)\approx0.81\\ \text{Bias }0.99&p(0)=0.01,p(1)=0.99&H(p)\approx0.08 \end{array} \end{align*}$ *Example*. Tow fair coins $\begin{align*} S&= \set{00,01,10,11}, \;p(00)=p(01)=p(10)=p(11)=1/4\\ H(p)&= 4\times (\frac{1}{4}\log4)=2 \end{align*}$ *Lemma*. Entropy $H(p)$ for a uniform distribution over $k$ outcomes is $\log k$ $\begin{align*} H(p)=\sum\limits_{x}p(x)\log\frac{1}{p(x)}=k\cdot\frac{1}{k}\log k=\log k \end{align*}$ ### 2.1 Justifying properties of entropy *Entropy* is the **only** measure that satisfies these 6 properties. [Aczel-Forte-Ng (1975)] 1. *Expansibility*. If $X$ has distribution $\left(p_1, \ldots, p_n\right)$ and $Y$ has distribution $\left(p_1, \ldots, p_n, 0\right)$ then $H(X)=H(Y)$ 2. *Symmetry*. Distribution $(p, 1-p)$ has the same entropy as $(1-p, p)$. 3. *Additivity*. If $X, Y$ are independent then $H(X, Y)=H(X)+H(Y)$, where $H(X,Y)$ is the joint distribution. 4. *Subadditivity*. $H(X, Y) \leq H(X)+H(Y)$ 5. *Normalization*. A fair coin has entropy 1. 6. "*Small for small probability*". The entropy of a coin of bias $p$ goes to 0 as $p \downarrow 0$. *Proof*. Entropy is additive. Say $X\sim p(x), Y\sim q(y)$, then the joint distribution is $r(x,y)=p(x)q(y)$ $\begin{align*} H(X,Y)&= \sum\limits_{X,Y}r(x,y)\log\frac{1}{r(x,y)}=\sum\limits_{X,Y}p(x)q(y)\frac{1}{p(x)q(y)}\\ &= \sum\limits pq\log\frac{1}{p}+\sum\limits pq\log\frac{1}{q}\\ &= \sum\limits_{X}p(x)\log\frac{1}{p(x)}+\sum\limits_{Y}q(y)\log\frac{1}{q(y)}=H(X)+H(Y) \quad\text{Q.E.D.} \end{align*}$ *Proof*. Entropy is subadditive. $\begin{align*} H(X)+&H(Y)-H(X,Y)\\ &= \sum\limits_Xp(x)\log \frac{1}{\log p(x)}+\sum\limits_Yq(y)\log \frac{1}{\log q(y)} - \sum\limits_{X,Y}r(x,y)\log\frac{1}{r(x,y)}\\ &= \sum\limits_{X,Y}r(x,y)\log\frac{1}{p(x)}+\sum\limits_{X,Y}r(x,y)\log\frac{1}{q(x)} -\sum\limits_{X,Y}r(x,y)\log\frac{1}{r(x,y)}\\ &= \sum\limits_{x,y}r(x,y)\log\frac{r(x,y)}{p(x)q(y)}=KL\big(r(x,y)\Vert(x)q(y)\big) \geq0 \end{align*}$ ### 2.2 KL divergence and entropy Canonical distance function between probability distributions: ***KL divergence***. $\begin{align*} KL(p, q)=\sum_{x \in S} p(x) \log \frac{p(x)}{q(x)} \end{align*}$ - Can show that $K(p, q) \geq 0$, with equality iff $p=q$. *Lemma*. If $p$ is a distribution on $S$ and $u$ is a uniform distribution on $S$ then $\begin{align*} H(p)&= \log |S|-KL(p, u)\\\\ \text{Proof: } KL(p,u)&= \sum\limits_{X\in S}p(x)\log\frac{p(x)}{u(x)}=\sum\limits p(x)\log\frac{p(x)}{1/|S|}\\ &= \sum\limits_Xp(x)\log|S|-\sum\limits_Xp(x)\log \frac{1}{p(x)}\\ &= \log |S|-H(p) \end{align*}$ ## 3 Information Projection Think of this page as the *probability simplex* $\Delta=\left\{p \in \mathbb{R}^{|S|}: p_x \geq 0, \sum_x p_x=1\right\}$ - Find the point $p \in L$ that is closest to $\pi$ in $\mathrm{KL}$ divergence. Note that $L$ is specified by the set of observed constraints. - We say $p$ is the ***I-projection*** (i.e. information projection) of $\pi$ onto affine subspace $L$. **Solution by Lagrange multipliers**. The dual problem is: $\begin{align*} F(p)&= \sum\limits_x p_x\ln\frac{p_x}{\pi_x}-\sum\limits_{i=1}^k\theta_i\left(\sum\limits_xp_x T_i(x)-b_i\right)-v\left(\sum\limits_x p_x-1 \right)\\ \frac{\partial F}{\partial p_x}&= 1+\ln p_x-\ln \pi_x-\sum\limits_{i=1}^k\theta_iT_i(x)-v=0\\ \rightarrow \ln \frac{p_x}{\pi_x}&=\sum\limits_{i=1}^k\theta_iT_i(x)+v-1 \\ \rightarrow p_x&= {\pi_x \exp\left(\sum\limits_{i=1}^k \theta_iT_i(x)\right)}/Z \end{align*}$ ![[Pasted image 20230509134201.png|400]] **Back to the birds example** $\begin{align*} p(x) \propto \pi(x) \exp \left(\theta_1 \cdot \operatorname{avgtemp}(x)+\theta_2 \cdot \text { elevation }(x)+\cdots\right) \end{align*}$ E.g. $\theta_2=0.81 \Longrightarrow$ each additional unit of elevation multiplies the probability by $e^{0.81} \approx 2.25$ ## 4 Exponential Distribution Family The following 3 fully specifies an exponential family - *Outcome space* $S \subset \mathbb{R}^r$. - *Base* (i.e. *prior*) measure $\pi: \mathbb{R}^r \rightarrow \mathbb{R}$. - *Features* $T_1, \ldots, T_k: S \rightarrow \mathbb{R}$. Write $T(x)=\left(T_1(x), \ldots, T_k(x)\right) \in \mathbb{R}^k$. The ***exponential family*** generated by $(S, \pi, T)$ consists of distributions $ \boxed{p_\theta(x)=\frac{1}{Z_\theta} e^{\theta \cdot T(x)} \pi(x)}, \quad \theta \in \mathbb{R}^k $ - *Partition function* $Z_\theta=\sum_{x \in S} e^{\theta \cdot T(x)} \pi(x)$ or $\int_S e^{\theta \cdot T(x)} \pi(x) d x$. - *Natural parameter space* $\Theta=\left\{\theta \in \mathbb{R}^k: G(\theta)<\infty\right\}$ - *Conventional form*: Write $G(\theta)=\ln Z_\theta$, the $\log$ partition function. Then $\begin{align*} p_\theta(x)=\exp (\theta \cdot T(x)-G(\theta)) \pi(x), \quad \theta \in \Theta \end{align*}$ ### 4.1 Classical distributions as exponential family **Bernoulli distribution**: $\text{Bernoulli}(x;p)=\begin{cases} 1-p &\text{if }x=0\\p&\text{if }x=1 \end{cases}$ (Express coin of bias $q$ in exponential family form) $\begin{align*} &S=\set{0,1},\quad T(x)=x,\quad \pi\equiv1\\ \quad \begin{cases} p_\theta(1)=\frac{e^\theta}{Z_\theta} \\p_\theta(0)=\frac{1}{Z_\theta} \end{cases}&\Rightarrow \begin{cases} q=\frac{e^\theta}{1+e ^\theta}\\Z_\theta=1+e^\theta\end{cases},\quad p_\theta(x)=q=\boxed{\frac{e^\theta x}{1+e^\theta}} \end{align*}$ - Note that instead of parameterizing the distribution using $p\in[0,1]$, we can parameterize the Bernoulli distribution by $\theta\in \mathbb{R}_{}^{\theta}$. $p=1/2 \;\equiv\; \theta=0$ **Poisson distribution**: $\operatorname{Poisson}(k;\lambda)=e^{-\lambda} \lambda^k / k !$, $\begin{align*} &\begin{array}{c|c|c|c}S= \mathbb{N} & T(x)=x & \pi(x)=\frac{1}{x!} & \theta=\ln \lambda \end{array}\\\\ &Z_\theta = \underset{\text{Taylor expansion}}{\underbrace{\sum\limits_{x=0}^\infty e^{\theta x}/x! }}=e^{e^\theta}\\ &p_\theta(x)= \boxed{\frac{1}{Z_\theta}e^{\theta x}\cdot \frac{1}{x!}}=\text{Poisson}(\lambda=e^\theta) \end{align*}$ - The maximum entropy choice is poisson if all we know is $\mathbb{E}X$ and that $S=\mathbb{N}$, since $T(x)=x$ **Normal distribution**: $N(\mu, \sigma^2)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-(x-\mu)^2 / 2 \sigma^2}$ $\begin{align*} \begin{array}{c|c|c|c} S=\mathbb{R} & T(x)=(x,x^2)& \pi(x)=1&\theta=(\frac{\mu}{ \sigma^2},-\frac{1}{2\sigma^2})\end{array}\\ \end{align*}$ - To get the features $T(x)$, need to re-formulate $N(\mu,\sigma^2)$: $\begin{align*} N(\mu,\sigma^2)&= \frac{1}{\sigma \sqrt{2 \pi}}\exp\left(-\frac{\boxed{x^2}}{2 \sigma ^2}+\frac{\boxed{x}\mu}{\sigma^2}-\frac{\mu^2}{2\sigma^2}\right) \\ \Rightarrow T(x)&= (x,x^2),\quad\theta=(\frac{\mu}{ \sigma^2},-\frac{1}{2\sigma^2}) \end{align*}$ - The maximum entropy choice is Gaussian is all we know is $\mathbb{E}X, \mathbb{E}[X^2]$ and that $S=\mathbb{R}_{}^{}$. ### 4.2 Fitting exponential family distribution Pick an exponential family $\left\{p_\theta: \theta \in \Theta\right\}$ with - Outcome space $S \subset \mathbb{R}^r$. - Base measure $\pi: \mathbb{R}^r \rightarrow \mathbb{R}$. - Features $T_1, \ldots, T_k: S \rightarrow \mathbb{R}$. Write $T(x)=\left(T_1(x), \ldots, T_k(x)\right) \in \mathbb{R}^k$. Given data $x_1, \ldots, x_n \in S$, want to choose a model $p_\theta$. 1. Maximum-likelihood coincides with the method of moments. The maximum-likelihood solution is the $\theta$ for which $ \mathbb{E}_{X \sim p_\theta}[T(X)]=\frac{1}{n} \sum_{i=1}^n T\left(x_i\right) . $ Hence $T(x)$ is a *sufficient statistic* for estimating $\theta$. 2. We can find this $\theta$ by solving a convex program.