# Bayesian Inference: Exponential Family and Conjugate Prior [[lec13_bayesian_inference.pdf]] $\begin{align*} &\boxed{\text{Posterior }\propto \text{ Prior }\times\text{ likelihood}}\\\\ &\boxed{\begin{array}{ccccc} \text{Uniform prior }&+&\text{binomial likelihood}&\Rightarrow &\text{Beta posterior}\\\\ \text{Beta prior } &+& \text{binomial likelihood} &\Rightarrow& \text{Beta posterior}\\ \text{Gamma prior } &+& \text{Poisson likelihood} &\Rightarrow& \text{Gamma posterior}\\ \text{Dirichlet prior } &+&\text{multinomial likelihood} &\Rightarrow& \text{Dirichlet posterior} \end{array}} \end{align*}$ ## 1 Bayesian vs. Frequentist Inference We get data $x_1, \ldots, x_n \sim p_\theta$ and would like to estimate $\theta$ - ***Frequentist***: treat $\theta$ as an *unknown, non-random quantity*. For instance, pick the maximum-likelihood value $ \underset{\theta}{\arg \max } \operatorname{Pr}\left(x_1, \ldots, x_n \mid \theta\right)=\underset{\theta}{\arg \max } \prod_{i=1}^n p_\theta\left(x_i\right) . $ - ***Bayesian***: treat $\theta$ as a *random variable* with a *prior* distribution $q_0$. Given the data, the posterior distribution of $\theta$ is $\begin{align*} q_n(\theta)=&\operatorname{Pr}\left(\theta \mid x_1, \ldots, x_n\right) \propto q_o(\theta) \operatorname{Pr}\left(x_1, \ldots, x_n \mid \theta\right)\\ &\boxed{\text{Posterior }\propto \text{ Prior }\times\text{ likelihood}} \end{align*}$ ### 1.1 Estimate binomial parameter via Bayesian inference (From Gelman.) What fraction of human births are female? - Laplace looked at children born in Paris, 1745-1770. $\begin{align*} \begin{array}{c|c|c} \# \text { girls }=241,945& \# \text { boys }=251,527& \text { total }=493,472 \text {. } \end{array} \end{align*}$ - Female fraction $=0.490$. - Let $\theta$ be the probability that a child is female. Let $n$ be the number of children observed. - Suppose we see $F$ females and $M$ males. Then $F \sim \operatorname{binomial}(n, \theta)$. *Bayesian inference*: put a uniform prior on $\theta$. What is the probability that $\theta<0.5$ ? - Choose a *uniform prior*, $q_o(\theta)=1$ for all $\theta \in[0,1]$. - What is the posterior $q_n$ after seeing $F=f, M=m$ ? $\begin{align*} q_n(\theta|f,m)&\propto q_n(\text{Pr}\big(f,m|\theta)\big)=1\cdot C(n,f)\theta^f(1-\theta)^m\\ &\propto \theta^f(1-\theta)^m \quad\text{(the Beta}(f+1,m+1)\text{ distribution)}\\ p_n(\theta\ge0.5) &= \int_{0.5}^1q_n(\theta)d \theta\approx 1.15\times 10^{-42}\\ &\boxed{\text{Uniform prior }+\text{binomial likelihood}\Rightarrow \text{Beta posterior}} \end{align*}$ ## 2 Rejection sampling Sampling is a key aspect in Bayesian inference: Sampling from the posterior can help us estimate the prior. *Goal*: Sample from an unknown distribution $f$. ![[Pasted image 20230516124719.png|600]] - What is the probability this procedure outputs a given $X$? $\begin{align*} \text{Pr}(\text{outputs \textit{x } on a particular trial} ) \\ = \underset{\text{Pr}(\text{generats } x)}{\underbrace{ g(x) }}\cdot \underset{\text{Pr}(\text{accept } x)}{\underbrace{ \frac{f(x)}{Mg(x)} }}=\frac{f(x)}{M} \quad\propto & \text{ Correct distribution } f(x) \end{align*}$ - What is the expected number of trials before a sample is generated? $\begin{align*} \text{Pr}(\text{output sth. on a given trial} ) =\sum\limits_{x}\frac{f(x)}{M}&= \frac{1}{M}\\ \therefore \mathbb{E}[\text{\# trials to give 1 sample}]&= M \end{align*}$ - $g$ can be prior, $f$ can be posterior in the context of Bayesian inference. *Cons*: 1) Slow for large $M$. 2) Requires computing $f(x)$, which can be infeasible ## 3 Beta Distribution and Conjugate Prior ### 3.1 Beta distribution The $\operatorname{beta}(\alpha, \beta)$ distribution on $[0,1]$ has functional form for $\alpha, \beta>0$: $\begin{align*} p(x)=\boxed{\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} x^{\alpha-1}(1-x)^{\beta-1}}\\ \end{align*}$ where $\Gamma(\cdot)$ is the gamma function. Recall $\Gamma(z)=\int_0^{\infty} t^{z-1} e^{-t} d t$. - Useful identity: $\Gamma(z+1)=z \Gamma(z)$. (gamma function is an extension of factorial) - Basic properties of the $\operatorname{beta}(\alpha, \beta)$ distribution: $\begin{align*} \begin{array}{c|c|c} \text { Mean } =\frac{\alpha}{\alpha+\beta} & \text { Mode } =\frac{\alpha-1}{\alpha+\beta-2} & \text { Variance } =\frac{\alpha \beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \end{array} \end{align*}$ ![[Pasted image 20230511132435.png|500]] *Sparsity-inducing prior*: Leads to $\theta$ very close to $0$ and $1$ ![[Pasted image 20230511133004.png|200]] *Laplace's law of succession*: What is the probability that the sun will rise tomorrow, given that it has risen every day for the past 5000 years? - Let $\theta$ be the probability that the sun rises on any given day. - Place a uniform prior on $\theta$. - We have $n=5000 \times 365=1,826,213$ observations. - Recall: ==uniform prior + binomial likelihood $\Longrightarrow$ beta posterior==. - Posterior $q_n$ is $\operatorname{beta}(n+1,1)$. $ \operatorname{Pr}(\text { sun rises tomorrow })=\int_0^1 \theta q_n(\theta) d \theta=\mathbb{E}_{q_n} \theta=\frac{n+1}{n+2} . $ ### 3.2 Conjugate Prior ***Conjugate prior***: posterior is from the same family as the prior. $\begin{align*} \boxed{\text{Beta prior } + \text{binomial likelihood} \Rightarrow \text{Beta posterior}} \end{align*}$ ***Conjugate prior for Beta***. Suppose observed $s$ successes and $f$ failures - Prior $\text{beta}(\alpha, \beta)$ - *Posterior is* $\text{beta}(\alpha+s, \beta+f)$ ## 4 Conjugate prior for exponential family Take any exponential family with features $T(x)=\left(T_1(x), \ldots, T_k(x)\right)$ and $\theta \in \Theta$: $ p_\theta(x)=e^{\theta \cdot T(x)-G(\theta)} \pi(x) $ Its *conjugate prior* over $\Theta$ is given by: - Define features $U(\theta)=\left(\theta_1, \ldots, \theta_k,-G(\theta)\right) \in \mathbb{R}^{k+1}$ - Gives a $(k+1)$-parameter family indexed by $\eta=\left(\eta_1, \ldots, \eta_k\right), \lambda$ : $ p_{\eta, \lambda}(\theta)=e^{(\eta, \lambda) \cdot U(\theta)-F(\eta, \lambda)} \nu(\theta) $ with base measure $\nu$ on $\Theta$. ### 4.1 Conjugate prior for binomial likelihood (as Beta distribution) $\begin{align*} \boxed{\text{Beta prior }+\text{binomial likelihood}\Rightarrow \text{Beta posterior}} \end{align*}$ Goal: Estimating Binomial distribution with arbitrary prior Assume a coin with unknown bias $\theta \in[0,1]$. See $n$ observations - $h$ heads and $t$ tails. The posterior distribution is: $\begin{align*} q_n(\theta)\propto q_o(\theta)\cdot \theta^h\cdot (1-\theta)^t \end{align*}$ How to answer questions like "What's $\text{Pr}( \theta<0.5)quot;? - Fine gridding of $[0,1]$, approximate $q_o, q_n$ by discrete distribution over grid points. - Sample from $q_n$. Easy if $q_n$ is beta. Otherwise, use methods like *rejection sampling*. ![[7_bayesian 2023-05-16 12.43.00.excalidraw.svg|200]] See $n$ observations, of which $s$ are successes - What is the posterior $q_n$ ? $\begin{align*} q_0(\theta) &\propto \theta^{\alpha-1}(1-\theta)^{\beta-1} &\quad\text{(prior)}\\\\ q_n(\theta)&\propto q_0(\theta)\text{Pr}(\text{s heads, n-s tails}|\theta) &\quad\text{(prior * likelihood)}\\ &\propto \theta^{\alpha-1}(1-\alpha)^{\beta-1}\theta^s(1-\theta)^{n-s} \end{align*}$ - This is a $\text{Beta}(\alpha+s, \beta+n-s)$ distribution. *Equivalent sample size*: prior beta $(\alpha, \beta)$ is the same as - Uniform prior - Seeing $\alpha-1$ successes and $\beta-1$ failures - Prior is eventually overwhelmed by observations. - Using a conjugate prior is a huge convenience. ### 4.2 Conjugate prior for Poisson likelihood (as Gamma distribution) $\begin{align*} \boxed{\text{Gamma prior }+\text{Poisson likelihood}\Rightarrow \text{Gamma posterior}} \end{align*}$ Recall $\text{Poisson}(\theta)$ distribution over $\mathbb{N}: p_\theta(x)=e^{-\theta} \theta^x / x !$. Claim: *Conjugate prior* is $\text{Gamma}(\alpha, \beta)$ distribution over $\mathbb{R}^{+}$: $\begin{align*} \operatorname{Pr}(\theta)=\frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha-1} e^{-\beta \theta}=q_0(\theta)\\\\ \begin{array}{c|c|c} \text{Mean: } \alpha/\beta& \text{Mode: } (\alpha-1)/\beta&\text{Var: } \alpha/\beta^2 \end{array} \end{align*}$ *What is the posterior* after seeing $x_1, \ldots, x_n$? - *Likelihood*: Poisson distribution $p_\theta(x)=e^{-\theta} \theta^x / x!\propto e^{-\theta}$ - *Prior* : $q_0(\theta)=\text{Gamma}(\alpha, \beta)=\operatorname{Pr}(\theta)=\frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha-1} e^{-\beta \theta}\propto \theta^{\alpha-1}e^{-\beta \theta}$ $\begin{align*} \underset{\text{Posterior}}{\underbrace{q_n(\theta)}} &\propto \underset{\text{Prior }}{\underbrace{ q_0(\theta)}}\cdot \underset{\text{Likelihood}}{\underbrace{ \text{Pr}( x_1,\cdots, x_n|\theta) }}\\ &\propto \theta^{\alpha-1}e^{-\beta \theta}e^{-n \theta} \theta^{x_1}\cdots \theta^{x_n}\\ &= \theta^{\alpha+x_1+\cdots+x_n-1}e^{-(\beta+n)\theta}\\ &= \text{Gamma} (\alpha+x_1+\cdots+x_n, \; \beta+n) \end{align*}$ - *Equivalent sample size*: The prior whose effect is similar to $n_0$ observations w/ mean $\mu$ is $\text{Gamma}(n_0 \mu, n_0)$ ### 4.3 Conjugate prior for Multinomial likelihood (as Dirichlet distribution) $\begin{align*} \boxed{\text{Dirichlet prior }+\text{Multinomial likelihood}\Rightarrow \text{Dirichlet posterior}} \end{align*}$ Goal: Bayesian inference for distribution over prob. simplex $\theta \in \Delta_k$. (Draw $n$ samples, can get counts $\left(x_1, \ldots, x_k\right)$) - *Prior*: $\text{Dirichlet}\left(\alpha_1, \ldots, \alpha_k\right)$ - *Likelihood*: $\text{Binomial}(x_1,\cdots,x_k|\theta)$ - *Posterior*: $\text{Dirichlet}\left(\alpha_1+x_1, \alpha_2+x_2, \ldots, \alpha_k+x_k\right)$ *Proof*. The posterior is derived by: $\begin{aligned} \operatorname{Pr}\left(\theta \mid x_1, \ldots, x_k\right) & \propto \operatorname{Pr}(\theta) \operatorname{Pr}\left(x_1, \ldots, x_k \mid \theta\right) \\ & \propto \theta_1^{\alpha_1-1} \theta_2^{\alpha_2-1} \cdots \theta_k^{\alpha_k-1} \;\cdot\;\theta_1^{x_1} \theta_2^{x_2} \cdots \theta_k^{x_k} \\ & =\theta_1^{\alpha_1+x_1-1} \theta_2^{\alpha_2+x_2-1} \cdots \theta_k^{\alpha_k+x_k-1}\\ &\equiv \text{Dir}\left(\alpha_1+x_1, \alpha_2+x_2, \ldots, \alpha_k+x_k\right) \end{aligned}$ ## 5 Dirichlet Distribution ### 5.1 Dirichlet distribution ***Dirichlet distribution*** with parameter $\alpha \in \mathbb{R}_{+}^k$, *over the probability simplex* $\Delta_k=\left\{\left(\theta_1, \ldots, \theta_k\right): \theta_i \geq 0, \sum_i \theta_i=1\right\}$. (probability simplex over $k$ outcomes): $ \text{Dirichlet}(\alpha)=\operatorname{Pr}\left(\theta_1, \ldots, \theta_k\right)=\frac{\Gamma\left(\alpha_1+\cdots+\alpha_k\right)}{\Gamma\left(\alpha_1\right) \cdots \Gamma\left(\alpha_k\right)} \theta_1^{\alpha_1-1} \theta_2^{\alpha_2-1} \cdots \theta_k^{\alpha_k-1} $ *Properties*. $\begin{aligned} \begin{array}{c|c|c} \mathbb{E} \theta_i =\frac{\alpha_i}{\alpha_1+\cdots+\alpha_k} \quad &\text { Mode: } \theta_i =\frac{\alpha_i-1}{\alpha_1+\cdots+\alpha_k-k}\quad &\operatorname{var}\left(\theta_i\right) =\frac{\alpha_i\left(\alpha_1+\cdots+\alpha_k-\alpha_i\right)}{\left(\alpha_1+\cdots+\alpha_k\right)^2\left(\alpha_1+\cdots+\alpha_k+1\right)} \end{array} \end{aligned}$ - Distribution gets concentrated to $\mu$ as $\alpha_i$ gets larger. - Distribution is uniform if $\alpha_1=\cdots=\alpha_n=1$ - Dirichlet distribution is a distribution over *probability simplex* - Dirichlet is a *generalization of Gamma distribution*. e.g. $\text{Dir}(\alpha_1, \alpha_2)\equiv \text{Beta}(\alpha_1,\alpha_2)$ *Example: Probability simplex over 3 outcomes*. ![[7_bayesian 2023-05-16 13.14.57.excalidraw.svg|600]] - $\mu=(1/3,1/3,1/3)$. - Dir(1,1,1) is the uniform distribution over the prob. simplex - Dir(1,1,1) -> Dir(2,2,2) -> Dir(4,4,4) -> Dir(8,8,8): Distribution concentrates to the mean. - Dir(1,1,1): Coin can have any bias (they are equally likely) - Dir(8,8,8): Coin likely to be fair - $\mu=(1/4,1/4,1/2)$. - Dir(1,1,2) -> Dir(2,2,4) -> Dir(3,3,6) -> Dir(6,6,12) ### 5.2 Motivation: Model sequence of amino acids Collection of proteins from the same family (similar sequence, structure, function). Each is a sequence of amino acids over a 20-letter alphabet $S=\left\{a_1, \ldots, a_{20}\right\}$. You align them and you want to model the distribution over $S$ at each position. $\begin{align*} \begin{array}{llllll}\cdots & a_1 & a_6 & a_7 & a_{20} & \cdots \\ \cdots & a_3 & a_8 & a_7 & a_8 & \cdots \\ \cdots & a_1 & a_9 & a_7 & a_{20} & \cdots\end{array} \end{align*}$ *Goal*: Infer the distribution $\theta \in \Delta_{20}$ for the first position. - Vector of counts at this position: $\left(x_1, \ldots, x_{20}\right)=(2,0,1,0, \ldots, 0)$ - Note that $\Delta_{20}$ is a probability simplex. Here $\left(x_1, \ldots, x_{20}\right) \sim$ multinomial $(n=3, \theta)$ The maximum-likelihood estimate $\theta_{M L}$ is given by: $\begin{align*} \theta_{ML}=\left(\frac{2}{3},0,\frac{1}{3},0,0,\cdots,0 \right) \end{align*}$ By analyzing databases of proteins, we find (hypothetically) 3 clusters: - $25 \%$ of positions are *highly conserved*: concentrated on a single amino acid. $\begin{align*} &\begin{array}{lllllllllllllllllllllllllllllllll} a_1 & a_2 & a_3 & a_4 & a_5 & a_6 & \boxed{a_7} & a_8 & a_9 & a_{10} \\ a_{11} & a_{12} & a_{13} & a_{14} & a_{15} & a_{16} & a_{17} & a_{18} & a_{19} & a_{20} \end{array}\\\\ &\alpha^{(1)}\equiv\text{A Dir}(\alpha_1,\cdots, \alpha_{20})\text{ with }\alpha_i<1 \text{ (sparsity-inducing parameter)} \end{align*}$ - $12 \%$ of positions combine a particular set of amino acids with similar properties. $\begin{align*} &\begin{array}{llllllllllllllllllllllllllllllll} a_1 & a_2 & a_3 & a_4 & a_5 & \boxed{a_6} & a_7 & a_8 & a_9 & a_{10} \\ \boxed{a_{11}} & \boxed{a_{12}} & \boxed{a_{13}} & a_{14} & a_{15} & a_{16} & a_{17} & a_{18} & a_{19} & a_{20} \end{array}\\\\ \alpha^{(2)}&\equiv\text{A Dir with } \alpha_i \text{ higher at } i\in\set{6,11,12,13} \end{align*}$ - $8 \%$ of positions combine a different set of amino acids. $\begin{align*} &\begin{array}{llllllllllllllllllllllllllllll} a_1 & \boxed{a_2} & \boxed{a_3} & \boxed{a_4} & a_5 & a_6 & a_7 & a_8 & a_9 & a_{10} \\ a_{11} & a_{12} & a_{13} & a_{14} & a_{15} &\boxed{ a_{16}} & \boxed{a_{17}} & a_{18} & a_{19} & a_{20} \end{array}\\\\ \alpha^{(3)}&\equiv\text{A Dir with } \alpha_i \text{ higher at } i\in\set{2,3,4,16,17} \end{align*}$ To *combine clusters*, simply use a *weighted sum*: $\begin{align*} \boxed{0.25\text{ Dir}(a^{(1)})+0.12\text{ Dir}(a^{(2)})+0.0.8\text{ Dir}(a^{(3)})} \end{align*}$ ## 6 Mixture of Conjugate Priors **A mixture of conjugate priors is conjugate** Unknown parameter $\theta$, *prior* is a mixture: $ \sum_{j=1}^k \underset{w_j}{\underbrace{ \operatorname{Pr}(J=j) }}\operatorname{Pr}(\theta \mid J=j) . $ After seeing data, *posterior* is a mixture: $ \sum_{j=1}^k \underset{\text{new } w_j}{\underbrace{ \operatorname{Pr}(J=j \mid \text { data }) }} \underset{\text{posterior for }j\text{-th prior}}{\underbrace{ \operatorname{Pr}(\theta \mid J=j, \text { data }) }} $ *Example*. Beta-binomial. Coin of unknown bias $\theta \in[0,1]$. - *Prior*: $w_1 \operatorname{beta}\left(\alpha_1, \beta_1\right)+w_2 \operatorname{beta}\left(\alpha_2, \beta_2\right)$. - See $n$ coin tosses, of which $h$ are heads and $t$ are tails. - *Posterior* is given by: $\begin{align*} \text{Pr}(\theta|\text{data} )&\propto \text{Pr}(\theta )\text{Pr}(\text{data}|\theta )\\ &= \left(w_1\frac{\Gamma(\alpha_1+\beta_1)}{\Gamma(\alpha_1)+\Gamma(\beta_1)}\theta^{\alpha_1-1}(1-\theta)^{\beta_1-1}+w_2 \frac{\Gamma\left(\alpha_2+\beta_2\right)}{\Gamma\left(\alpha_2\right) \Gamma\left(\beta_2\right)} \theta^{\alpha_2-1}(1-\theta)^{\beta_2-1} \right)\theta^h(1-\theta)^t\\ &= (\text{some weight})\times \theta^{\alpha_1+h-1}(1-\theta)^{\beta_1+t-1}+ (\text{some weight})\times \theta^{\alpha_2+h-1}(1-\theta)^{\beta_2+t-1}\\ &= \text{Some \textbf{mixture} of }\text{Beta}(\alpha_1+h,\beta_1+t)\text{ and }\text{Beta}(\alpha_2+h,\beta_2+t) \end{align*}$ ## 7 Resources - Bradley Efron. A 250-year argument: Belief, behavior, and the bootstrap. - Andrew Gelman, John Carlin, Hal Stern, Donald Rubin. Bayesian Data Analysis. - Kevin Murphy. Machine Learning: A Probabilistic Perspective.