# Maximum Entropy and Exponential Family
[[lec12_maximum_entropy.pdf]]
## 1 Maximum Entropy Distribution Modeling
Suppose we have a prior distribution $\pi$ on $S$ (e.g. distribution of a broader class of birds).
$\begin{align}
&\min KL(p, \pi) & \quad\text{(want distr. \textit{p} close to }\pi) \\
&\sum_{x \in S} p_x T_i(x)=b_i, \quad 1 \leq i \leq k & \quad\text{(observed constraint)}\\
&p_x \geq 0,\; x \in S ,\quad\sum_{x \in S} p_x=1 & \quad\text{(constraint on } p_x \text{ being pmf over } S)
\end{align}$
- If $\pi=U(S)$, then it's exactly the previous formulation
- Why not $K(\pi,p)$: KL divergence is not commutative. Want the smooth distribution to be $q(x)$ on the denominator.
$\begin{align*}
KL(p,q)=\sum\limits_Xp(x)\log\frac{p(x)}{\underset{\text{cannot be 0}}{\boxed{q(x)}}}
\end{align*}$
*Solution*.
$\begin{align*}
\boxed{p(x)=\frac{1}{Z} \exp \left(\sum_{i=1}^k \theta_i T_i(x)\right) \pi(x)=\frac{1}{Z} e^{\theta^\top T(x)} \pi(x)}\\
\end{align*}$
- $T(x)=\left(T_1(x), \ldots, T_k(x)\right)$ and $\theta=\left(\theta_1, \ldots, \theta_k\right)$.
- By varying $(S, T, \pi)$, we get Gaussians, Poissons, binomials, Markov random fields... These are all *exponential families*.
*Derive solution via Lagrange multipliers*
$\begin{align*}
F(p)&= \sum_x p_x \ln \frac{p_x}{\pi_x}-\sum_{i=1}^{k \in S} \theta_i\left(\sum_x p_x T_i(x)-b_i\right)-\nu\left(\sum_x p_x-1\right) \\
\frac{d F}{d p_x}&= p_x \frac{1}{p_x}+\ln p_x-\ln \pi_x-\sum_{i=1}^k \theta_i T_i(x)-\gamma \\
\quad \frac{d F}{d p_x}&= 0 \Rightarrow \ln \frac{p_x}{\pi_x}=\sum_{i=1}^k \theta_i T_i(x)+\nu-1\\
p_x&= \frac{1}{z}{\pi_x\exp\left(\sum_{i=1}^k \theta_i T_i(x) \right)}
\end{align*}$
### 1.1 Motivation: model geo-distribution of birds
*Problem set-up*.
- Observes $\boxed{1611}$ data points. Desired resolution: $386\times 286$ grid = $\boxed{S=110,396}\gg 1611$ pixels!
- Want a distribution over $S=\{$ locations $\}, |S|=110,396$
- Represent each pixel $x \in S$ by environmental features $T(x)=\left(T_1(x), \ldots, T_k(x)\right)$
- Say $T_1(x)$ is the elevation, $T_k(x)$ rainfall, etc (slope, precipitation, ...).
Use data to estimate $\mathbb{E}T_i(x)$ for $i\in[1,k]$
- $\mathbb{E}T_1=$ Avg. elevation at which this bird is seen.
*The maximum entropy approach*.
1. Suppose we had no sightings at all. The most reasonable model, in the absence of any information, might just be the uniform distribution over $S$.
2. But we have some sightings, yielding estimates $\mathbb{E} T_i(x)=b_i, 1 \leq i \leq k$. Pick a distribution $p$ over $S$ that respects these constraints, i.e., for all $i$: (but is otherwise *as random as possible*)
$\begin{align*}
\sum_{x \in S} p(x) T_i(x)=b_i
\end{align*}$
3. Want a distribution $p$ over $S$ such that avg. elevation, ... matches empirical observations
*Formalizing as a ==convex optimization== problem*.
- Domain: finite set $S$, e.g. geographical locations
- Features $T: S \rightarrow \mathbb{R}^k$, e.g. environmental features
- Observed *constraints*: $\mathbb{E} T_i(x)=b_i$, for $i=1, \ldots, k$
Find the distribution $p$ on $S$ that has maximum entropy subject to the constraints.
$
\begin{gathered}
\max \sum_{x \in S} p_x \ln \frac{1}{p_x} &\quad\text{(maximize entropy)}\\
\sum_{x \in S} p_x T_i(x)=b_i, \quad 1 \leq i \leq k & \quad\text{(observed constraints)}\\
p_x \geq 0, \; x \in S ,\quad \sum_{x \in S} p_x=1 &\quad (\text{constrain } p_x\text{ to be a distribution over }S)
\end{gathered}
$
*Solution*.
$
p(x) \propto \pi(x) \exp \left(\theta_1 \cdot \operatorname{avgtem} p(x)+\theta_2 \cdot \text { elevation }(x)+\cdots\right)
$
- E.g. $\theta_2=0.81 \Longrightarrow$ each additional unit of elevation multiplies the probability by $e^{0.81} \approx 2.25$
## 2 Entropy and KL Divergence
The ***entropy*** of distribution $p$ over finite set $S$ is
$
H(p)=\sum_{x \in S} p(x) \log \frac{1}{p(x)}
$
- Characterizes how *random* a distribution is
- *Concave* w.r.t. to $p$
*Example*. Calculate entropy for the following coins. $S=\set{0,1}$
$\begin{align*}
\begin{array}{c|c|c}
\text{Fair}&p(0)=p(1)=\frac{1}{2}&H(p)=\log 2=1\\
\text{Bias } 3/4&p(0)=\frac{1}{4},p(1)=\frac{3}{4}&H(p)\approx0.81\\
\text{Bias }0.99&p(0)=0.01,p(1)=0.99&H(p)\approx0.08
\end{array}
\end{align*}$
*Example*. Tow fair coins
$\begin{align*}
S&= \set{00,01,10,11}, \;p(00)=p(01)=p(10)=p(11)=1/4\\
H(p)&= 4\times (\frac{1}{4}\log4)=2
\end{align*}$
*Lemma*. Entropy $H(p)$ for a uniform distribution over $k$ outcomes is $\log k$
$\begin{align*}
H(p)=\sum\limits_{x}p(x)\log\frac{1}{p(x)}=k\cdot\frac{1}{k}\log k=\log k
\end{align*}$
### 2.1 Justifying properties of entropy
*Entropy* is the **only** measure that satisfies these 6 properties. [Aczel-Forte-Ng (1975)]
1. *Expansibility*. If $X$ has distribution $\left(p_1, \ldots, p_n\right)$ and $Y$ has distribution $\left(p_1, \ldots, p_n, 0\right)$ then $H(X)=H(Y)$
2. *Symmetry*. Distribution $(p, 1-p)$ has the same entropy as $(1-p, p)$.
3. *Additivity*. If $X, Y$ are independent then $H(X, Y)=H(X)+H(Y)$, where $H(X,Y)$ is the joint distribution.
4. *Subadditivity*. $H(X, Y) \leq H(X)+H(Y)$
5. *Normalization*. A fair coin has entropy 1.
6. "*Small for small probability*". The entropy of a coin of bias $p$ goes to 0 as $p \downarrow 0$.
*Proof*. Entropy is additive. Say $X\sim p(x), Y\sim q(y)$, then the joint distribution is $r(x,y)=p(x)q(y)$
$\begin{align*}
H(X,Y)&= \sum\limits_{X,Y}r(x,y)\log\frac{1}{r(x,y)}=\sum\limits_{X,Y}p(x)q(y)\frac{1}{p(x)q(y)}\\
&= \sum\limits pq\log\frac{1}{p}+\sum\limits pq\log\frac{1}{q}\\
&= \sum\limits_{X}p(x)\log\frac{1}{p(x)}+\sum\limits_{Y}q(y)\log\frac{1}{q(y)}=H(X)+H(Y) \quad\text{Q.E.D.}
\end{align*}$
*Proof*. Entropy is subadditive.
$\begin{align*}
H(X)+&H(Y)-H(X,Y)\\
&= \sum\limits_Xp(x)\log \frac{1}{\log p(x)}+\sum\limits_Yq(y)\log \frac{1}{\log q(y)} - \sum\limits_{X,Y}r(x,y)\log\frac{1}{r(x,y)}\\
&= \sum\limits_{X,Y}r(x,y)\log\frac{1}{p(x)}+\sum\limits_{X,Y}r(x,y)\log\frac{1}{q(x)} -\sum\limits_{X,Y}r(x,y)\log\frac{1}{r(x,y)}\\
&= \sum\limits_{x,y}r(x,y)\log\frac{r(x,y)}{p(x)q(y)}=KL\big(r(x,y)\Vert(x)q(y)\big) \geq0
\end{align*}$
### 2.2 KL divergence and entropy
Canonical distance function between probability distributions: ***KL divergence***.
$\begin{align*}
KL(p, q)=\sum_{x \in S} p(x) \log \frac{p(x)}{q(x)}
\end{align*}$
- Can show that $K(p, q) \geq 0$, with equality iff $p=q$.
*Lemma*. If $p$ is a distribution on $S$ and $u$ is a uniform distribution on $S$ then
$\begin{align*}
H(p)&= \log |S|-KL(p, u)\\\\
\text{Proof: } KL(p,u)&= \sum\limits_{X\in S}p(x)\log\frac{p(x)}{u(x)}=\sum\limits p(x)\log\frac{p(x)}{1/|S|}\\
&= \sum\limits_Xp(x)\log|S|-\sum\limits_Xp(x)\log \frac{1}{p(x)}\\
&= \log |S|-H(p)
\end{align*}$
## 3 Information Projection
Think of this page as the *probability simplex* $\Delta=\left\{p \in \mathbb{R}^{|S|}: p_x \geq 0, \sum_x p_x=1\right\}$
- Find the point $p \in L$ that is closest to $\pi$ in $\mathrm{KL}$ divergence. Note that $L$ is specified by the set of observed constraints.
- We say $p$ is the ***I-projection*** (i.e. information projection) of $\pi$ onto affine subspace $L$.
**Solution by Lagrange multipliers**. The dual problem is:
$\begin{align*}
F(p)&= \sum\limits_x p_x\ln\frac{p_x}{\pi_x}-\sum\limits_{i=1}^k\theta_i\left(\sum\limits_xp_x T_i(x)-b_i\right)-v\left(\sum\limits_x p_x-1 \right)\\
\frac{\partial F}{\partial p_x}&= 1+\ln p_x-\ln \pi_x-\sum\limits_{i=1}^k\theta_iT_i(x)-v=0\\
\rightarrow \ln \frac{p_x}{\pi_x}&=\sum\limits_{i=1}^k\theta_iT_i(x)+v-1 \\
\rightarrow p_x&= {\pi_x \exp\left(\sum\limits_{i=1}^k \theta_iT_i(x)\right)}/Z
\end{align*}$
![[Pasted image 20230509134201.png|400]]
**Back to the birds example**
$\begin{align*}
p(x) \propto \pi(x) \exp \left(\theta_1 \cdot \operatorname{avgtemp}(x)+\theta_2 \cdot \text { elevation }(x)+\cdots\right)
\end{align*}$
E.g. $\theta_2=0.81 \Longrightarrow$ each additional unit of elevation multiplies the probability by $e^{0.81} \approx 2.25$
## 4 Exponential Distribution Family
The following 3 fully specifies an exponential family
- *Outcome space* $S \subset \mathbb{R}^r$.
- *Base* (i.e. *prior*) measure $\pi: \mathbb{R}^r \rightarrow \mathbb{R}$.
- *Features* $T_1, \ldots, T_k: S \rightarrow \mathbb{R}$. Write $T(x)=\left(T_1(x), \ldots, T_k(x)\right) \in \mathbb{R}^k$.
The ***exponential family*** generated by $(S, \pi, T)$ consists of distributions
$
\boxed{p_\theta(x)=\frac{1}{Z_\theta} e^{\theta \cdot T(x)} \pi(x)}, \quad \theta \in \mathbb{R}^k
$
- *Partition function* $Z_\theta=\sum_{x \in S} e^{\theta \cdot T(x)} \pi(x)$ or $\int_S e^{\theta \cdot T(x)} \pi(x) d x$.
- *Natural parameter space* $\Theta=\left\{\theta \in \mathbb{R}^k: G(\theta)<\infty\right\}$
- *Conventional form*: Write $G(\theta)=\ln Z_\theta$, the $\log$ partition function. Then
$\begin{align*}
p_\theta(x)=\exp (\theta \cdot T(x)-G(\theta)) \pi(x), \quad \theta \in \Theta
\end{align*}$
### 4.1 Classical distributions as exponential family
**Bernoulli distribution**: $\text{Bernoulli}(x;p)=\begin{cases} 1-p &\text{if }x=0\\p&\text{if }x=1 \end{cases}$
(Express coin of bias $q$ in exponential family form)
$\begin{align*}
&S=\set{0,1},\quad T(x)=x,\quad \pi\equiv1\\
\quad \begin{cases} p_\theta(1)=\frac{e^\theta}{Z_\theta} \\p_\theta(0)=\frac{1}{Z_\theta}
\end{cases}&\Rightarrow \begin{cases} q=\frac{e^\theta}{1+e ^\theta}\\Z_\theta=1+e^\theta\end{cases},\quad p_\theta(x)=q=\boxed{\frac{e^\theta x}{1+e^\theta}}
\end{align*}$
- Note that instead of parameterizing the distribution using $p\in[0,1]$, we can parameterize the Bernoulli distribution by $\theta\in \mathbb{R}_{}^{\theta}$. $p=1/2 \;\equiv\; \theta=0$
**Poisson distribution**: $\operatorname{Poisson}(k;\lambda)=e^{-\lambda} \lambda^k / k !$,
$\begin{align*}
&\begin{array}{c|c|c|c}S= \mathbb{N} & T(x)=x & \pi(x)=\frac{1}{x!} & \theta=\ln \lambda
\end{array}\\\\
&Z_\theta = \underset{\text{Taylor expansion}}{\underbrace{\sum\limits_{x=0}^\infty e^{\theta x}/x! }}=e^{e^\theta}\\
&p_\theta(x)= \boxed{\frac{1}{Z_\theta}e^{\theta x}\cdot \frac{1}{x!}}=\text{Poisson}(\lambda=e^\theta)
\end{align*}$
- The maximum entropy choice is poisson if all we know is $\mathbb{E}X$ and that $S=\mathbb{N}$, since $T(x)=x$
**Normal distribution**: $N(\mu, \sigma^2)=\frac{1}{\sigma \sqrt{2 \pi}} e^{-(x-\mu)^2 / 2 \sigma^2}$
$\begin{align*}
\begin{array}{c|c|c|c}
S=\mathbb{R} & T(x)=(x,x^2)& \pi(x)=1&\theta=(\frac{\mu}{ \sigma^2},-\frac{1}{2\sigma^2})\end{array}\\
\end{align*}$
- To get the features $T(x)$, need to re-formulate $N(\mu,\sigma^2)$:
$\begin{align*}
N(\mu,\sigma^2)&= \frac{1}{\sigma \sqrt{2 \pi}}\exp\left(-\frac{\boxed{x^2}}{2 \sigma ^2}+\frac{\boxed{x}\mu}{\sigma^2}-\frac{\mu^2}{2\sigma^2}\right) \\
\Rightarrow T(x)&= (x,x^2),\quad\theta=(\frac{\mu}{ \sigma^2},-\frac{1}{2\sigma^2})
\end{align*}$
- The maximum entropy choice is Gaussian is all we know is $\mathbb{E}X, \mathbb{E}[X^2]$ and that $S=\mathbb{R}_{}^{}$.
### 4.2 Fitting exponential family distribution
Pick an exponential family $\left\{p_\theta: \theta \in \Theta\right\}$ with
- Outcome space $S \subset \mathbb{R}^r$.
- Base measure $\pi: \mathbb{R}^r \rightarrow \mathbb{R}$.
- Features $T_1, \ldots, T_k: S \rightarrow \mathbb{R}$. Write $T(x)=\left(T_1(x), \ldots, T_k(x)\right) \in \mathbb{R}^k$.
Given data $x_1, \ldots, x_n \in S$, want to choose a model $p_\theta$.
1. Maximum-likelihood coincides with the method of moments. The maximum-likelihood solution is the $\theta$ for which
$
\mathbb{E}_{X \sim p_\theta}[T(X)]=\frac{1}{n} \sum_{i=1}^n T\left(x_i\right) .
$
Hence $T(x)$ is a *sufficient statistic* for estimating $\theta$.
2. We can find this $\theta$ by solving a convex program.