quot;? - Fine gridding of $[0,1]$, approximate $q_o, q_n$ by discrete distribution over grid points. - Sample from $q_n$. Easy if $q_n$ is beta. Otherwise, use methods like *rejection sampling*. ![[7_bayesian 2023-05-16 12.43.00.excalidraw.svg|200]] See $n$ observations, of which $s$ are successes - What is the posterior $q_n$ ? $\begin{align*} q_0(\theta) &\propto \theta^{\alpha-1}(1-\theta)^{\beta-1} &\quad\text{(prior)}\\\\ q_n(\theta)&\propto q_0(\theta)\text{Pr}(\text{s heads, n-s tails}|\theta) &\quad\text{(prior * likelihood)}\\ &\propto \theta^{\alpha-1}(1-\alpha)^{\beta-1}\theta^s(1-\theta)^{n-s} \end{align*}$ - This is a $\text{Beta}(\alpha+s, \beta+n-s)$ distribution. *Equivalent sample size*: prior beta $(\alpha, \beta)$ is the same as - Uniform prior - Seeing $\alpha-1$ successes and $\beta-1$ failures - Prior is eventually overwhelmed by observations. - Using a conjugate prior is a huge convenience. ### 4.2 Conjugate prior for Poisson likelihood (as Gamma distribution) $\begin{align*} \boxed{\text{Gamma prior }+\text{Poisson likelihood}\Rightarrow \text{Gamma posterior}} \end{align*}$ Recall $\text{Poisson}(\theta)$ distribution over $\mathbb{N}: p_\theta(x)=e^{-\theta} \theta^x / x !$. Claim: *Conjugate prior* is $\text{Gamma}(\alpha, \beta)$ distribution over $\mathbb{R}^{+}$: $\begin{align*} \operatorname{Pr}(\theta)=\frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha-1} e^{-\beta \theta}=q_0(\theta)\\\\ \begin{array}{c|c|c} \text{Mean: } \alpha/\beta& \text{Mode: } (\alpha-1)/\beta&\text{Var: } \alpha/\beta^2 \end{array} \end{align*}$ *What is the posterior* after seeing $x_1, \ldots, x_n$? - *Likelihood*: Poisson distribution $p_\theta(x)=e^{-\theta} \theta^x / x!\propto e^{-\theta}$ - *Prior* : $q_0(\theta)=\text{Gamma}(\alpha, \beta)=\operatorname{Pr}(\theta)=\frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha-1} e^{-\beta \theta}\propto \theta^{\alpha-1}e^{-\beta \theta}$ $\begin{align*} \underset{\text{Posterior}}{\underbrace{q_n(\theta)}} &\propto \underset{\text{Prior }}{\underbrace{ q_0(\theta)}}\cdot \underset{\text{Likelihood}}{\underbrace{ \text{Pr}( x_1,\cdots, x_n|\theta) }}\\ &\propto \theta^{\alpha-1}e^{-\beta \theta}e^{-n \theta} \theta^{x_1}\cdots \theta^{x_n}\\ &= \theta^{\alpha+x_1+\cdots+x_n-1}e^{-(\beta+n)\theta}\\ &= \text{Gamma} (\alpha+x_1+\cdots+x_n, \; \beta+n) \end{align*}$ - *Equivalent sample size*: The prior whose effect is similar to $n_0$ observations w/ mean $\mu$ is $\text{Gamma}(n_0 \mu, n_0)$ ### 4.3 Conjugate prior for Multinomial likelihood (as Dirichlet distribution) $\begin{align*} \boxed{\text{Dirichlet prior }+\text{Multinomial likelihood}\Rightarrow \text{Dirichlet posterior}} \end{align*}$ Goal: Bayesian inference for distribution over prob. simplex $\theta \in \Delta_k$. (Draw $n$ samples, can get counts $\left(x_1, \ldots, x_k\right)$) - *Prior*: $\text{Dirichlet}\left(\alpha_1, \ldots, \alpha_k\right)$ - *Likelihood*: $\text{Binomial}(x_1,\cdots,x_k|\theta)$ - *Posterior*: $\text{Dirichlet}\left(\alpha_1+x_1, \alpha_2+x_2, \ldots, \alpha_k+x_k\right)$ *Proof*. The posterior is derived by: $\begin{aligned} \operatorname{Pr}\left(\theta \mid x_1, \ldots, x_k\right) & \propto \operatorname{Pr}(\theta) \operatorname{Pr}\left(x_1, \ldots, x_k \mid \theta\right) \\ & \propto \theta_1^{\alpha_1-1} \theta_2^{\alpha_2-1} \cdots \theta_k^{\alpha_k-1} \;\cdot\;\theta_1^{x_1} \theta_2^{x_2} \cdots \theta_k^{x_k} \\ & =\theta_1^{\alpha_1+x_1-1} \theta_2^{\alpha_2+x_2-1} \cdots \theta_k^{\alpha_k+x_k-1}\\ &\equiv \text{Dir}\left(\alpha_1+x_1, \alpha_2+x_2, \ldots, \alpha_k+x_k\right) \end{aligned}$ ## 5 Dirichlet Distribution ### 5.1 Dirichlet distribution ***Dirichlet distribution*** with parameter $\alpha \in \mathbb{R}_{+}^k$, *over the probability simplex* $\Delta_k=\left\{\left(\theta_1, \ldots, \theta_k\right): \theta_i \geq 0, \sum_i \theta_i=1\right\}$. (probability simplex over $k$ outcomes): $ \text{Dirichlet}(\alpha)=\operatorname{Pr}\left(\theta_1, \ldots, \theta_k\right)=\frac{\Gamma\left(\alpha_1+\cdots+\alpha_k\right)}{\Gamma\left(\alpha_1\right) \cdots \Gamma\left(\alpha_k\right)} \theta_1^{\alpha_1-1} \theta_2^{\alpha_2-1} \cdots \theta_k^{\alpha_k-1} $ *Properties*. $\begin{aligned} \begin{array}{c|c|c} \mathbb{E} \theta_i =\frac{\alpha_i}{\alpha_1+\cdots+\alpha_k} \quad &\text { Mode: } \theta_i =\frac{\alpha_i-1}{\alpha_1+\cdots+\alpha_k-k}\quad &\operatorname{var}\left(\theta_i\right) =\frac{\alpha_i\left(\alpha_1+\cdots+\alpha_k-\alpha_i\right)}{\left(\alpha_1+\cdots+\alpha_k\right)^2\left(\alpha_1+\cdots+\alpha_k+1\right)} \end{array} \end{aligned}$ - Distribution gets concentrated to $\mu$ as $\alpha_i$ gets larger. - Distribution is uniform if $\alpha_1=\cdots=\alpha_n=1$ - Dirichlet distribution is a distribution over *probability simplex* - Dirichlet is a *generalization of Gamma distribution*. e.g. $\text{Dir}(\alpha_1, \alpha_2)\equiv \text{Beta}(\alpha_1,\alpha_2)$ *Example: Probability simplex over 3 outcomes*. ![[7_bayesian 2023-05-16 13.14.57.excalidraw.svg|600]] - $\mu=(1/3,1/3,1/3)$. - Dir(1,1,1) is the uniform distribution over the prob. simplex - Dir(1,1,1) -> Dir(2,2,2) -> Dir(4,4,4) -> Dir(8,8,8): Distribution concentrates to the mean. - Dir(1,1,1): Coin can have any bias (they are equally likely) - Dir(8,8,8): Coin likely to be fair - $\mu=(1/4,1/4,1/2)$. - Dir(1,1,2) -> Dir(2,2,4) -> Dir(3,3,6) -> Dir(6,6,12) ### 5.2 Motivation: Model sequence of amino acids Collection of proteins from the same family (similar sequence, structure, function). Each is a sequence of amino acids over a 20-letter alphabet $S=\left\{a_1, \ldots, a_{20}\right\}$. You align them and you want to model the distribution over $S$ at each position. $\begin{align*} \begin{array}{llllll}\cdots & a_1 & a_6 & a_7 & a_{20} & \cdots \\ \cdots & a_3 & a_8 & a_7 & a_8 & \cdots \\ \cdots & a_1 & a_9 & a_7 & a_{20} & \cdots\end{array} \end{align*}$ *Goal*: Infer the distribution $\theta \in \Delta_{20}$ for the first position. - Vector of counts at this position: $\left(x_1, \ldots, x_{20}\right)=(2,0,1,0, \ldots, 0)$ - Note that $\Delta_{20}$ is a probability simplex. Here $\left(x_1, \ldots, x_{20}\right) \sim$ multinomial $(n=3, \theta)$ The maximum-likelihood estimate $\theta_{M L}$ is given by: $\begin{align*} \theta_{ML}=\left(\frac{2}{3},0,\frac{1}{3},0,0,\cdots,0 \right) \end{align*}$ By analyzing databases of proteins, we find (hypothetically) 3 clusters: - $25 \%$ of positions are *highly conserved*: concentrated on a single amino acid. $\begin{align*} &\begin{array}{lllllllllllllllllllllllllllllllll} a_1 & a_2 & a_3 & a_4 & a_5 & a_6 & \boxed{a_7} & a_8 & a_9 & a_{10} \\ a_{11} & a_{12} & a_{13} & a_{14} & a_{15} & a_{16} & a_{17} & a_{18} & a_{19} & a_{20} \end{array}\\\\ &\alpha^{(1)}\equiv\text{A Dir}(\alpha_1,\cdots, \alpha_{20})\text{ with }\alpha_i<1 \text{ (sparsity-inducing parameter)} \end{align*}$ - $12 \%$ of positions combine a particular set of amino acids with similar properties. $\begin{align*} &\begin{array}{llllllllllllllllllllllllllllllll} a_1 & a_2 & a_3 & a_4 & a_5 & \boxed{a_6} & a_7 & a_8 & a_9 & a_{10} \\ \boxed{a_{11}} & \boxed{a_{12}} & \boxed{a_{13}} & a_{14} & a_{15} & a_{16} & a_{17} & a_{18} & a_{19} & a_{20} \end{array}\\\\ \alpha^{(2)}&\equiv\text{A Dir with } \alpha_i \text{ higher at } i\in\set{6,11,12,13} \end{align*}$ - $8 \%$ of positions combine a different set of amino acids. $\begin{align*} &\begin{array}{llllllllllllllllllllllllllllll} a_1 & \boxed{a_2} & \boxed{a_3} & \boxed{a_4} & a_5 & a_6 & a_7 & a_8 & a_9 & a_{10} \\ a_{11} & a_{12} & a_{13} & a_{14} & a_{15} &\boxed{ a_{16}} & \boxed{a_{17}} & a_{18} & a_{19} & a_{20} \end{array}\\\\ \alpha^{(3)}&\equiv\text{A Dir with } \alpha_i \text{ higher at } i\in\set{2,3,4,16,17} \end{align*}$ To *combine clusters*, simply use a *weighted sum*: $\begin{align*} \boxed{0.25\text{ Dir}(a^{(1)})+0.12\text{ Dir}(a^{(2)})+0.0.8\text{ Dir}(a^{(3)})} \end{align*}$ ## 6 Mixture of Conjugate Priors **A mixture of conjugate priors is conjugate** Unknown parameter $\theta$, *prior* is a mixture: $ \sum_{j=1}^k \underset{w_j}{\underbrace{ \operatorname{Pr}(J=j) }}\operatorname{Pr}(\theta \mid J=j) . $ After seeing data, *posterior* is a mixture: $ \sum_{j=1}^k \underset{\text{new } w_j}{\underbrace{ \operatorname{Pr}(J=j \mid \text { data }) }} \underset{\text{posterior for }j\text{-th prior}}{\underbrace{ \operatorname{Pr}(\theta \mid J=j, \text { data }) }} $ *Example*. Beta-binomial. Coin of unknown bias $\theta \in[0,1]$. - *Prior*: $w_1 \operatorname{beta}\left(\alpha_1, \beta_1\right)+w_2 \operatorname{beta}\left(\alpha_2, \beta_2\right)$. - See $n$ coin tosses, of which $h$ are heads and $t$ are tails. - *Posterior* is given by: $\begin{align*} \text{Pr}(\theta|\text{data} )&\propto \text{Pr}(\theta )\text{Pr}(\text{data}|\theta )\\ &= \left(w_1\frac{\Gamma(\alpha_1+\beta_1)}{\Gamma(\alpha_1)+\Gamma(\beta_1)}\theta^{\alpha_1-1}(1-\theta)^{\beta_1-1}+w_2 \frac{\Gamma\left(\alpha_2+\beta_2\right)}{\Gamma\left(\alpha_2\right) \Gamma\left(\beta_2\right)} \theta^{\alpha_2-1}(1-\theta)^{\beta_2-1} \right)\theta^h(1-\theta)^t\\ &= (\text{some weight})\times \theta^{\alpha_1+h-1}(1-\theta)^{\beta_1+t-1}+ (\text{some weight})\times \theta^{\alpha_2+h-1}(1-\theta)^{\beta_2+t-1}\\ &= \text{Some \textbf{mixture} of }\text{Beta}(\alpha_1+h,\beta_1+t)\text{ and }\text{Beta}(\alpha_2+h,\beta_2+t) \end{align*}$ ## 7 Resources - Bradley Efron. A 250-year argument: Belief, behavior, and the bootstrap. - Andrew Gelman, John Carlin, Hal Stern, Donald Rubin. Bayesian Data Analysis. - Kevin Murphy. Machine Learning: A Probabilistic Perspective.