Bayesian Statistics - Inference and Posterior Approximation

>[!quote] In a Nutshell >Theory in statistics based on the Bayesian interpretation of probability, where probability expresses degree of belief in an [[Sigma-Algebra|event]] ([[Frequentist vs. Bayesian Statistics]]). The degree of belief may be based on prior knowledge about the event, such as the results of previous experiments, or on personal beliefs about the event. $\textcolor{purple}{\text{Posterior}}=\frac{\textcolor{red}{\text{Likelihood}}\times\textcolor{blue}{\text{Prior}}}{\text{Evidence}}$ --- #### Bayesian Inference Method of statistical inference that uses [[Bayes Theorem]] to update probability for a hypothesis as more evidence or information becomes available. - **Formal Definition with Parametric Distributions** - Data points $x$ are distributed $x \sim p(x|\theta)$ using parametric distributions with hyperparameters $\theta \sim p(\theta | \alpha)$ (prior) with fixed $\alpha$. Assume we observed a set of $n$ samples $\mathcal{D}= \{ x_{1},...,x_{n}\}$ and want to update our prior in order to predict the distribution for a new sample $\tilde{x}$. This is done via $\textcolor{purple}{p(\theta \mid \mathcal{D},\alpha)} = \frac{p(\mathcal{D}\mid\theta,\alpha)p(\theta,\alpha)}{p(\mathcal{D}\mid\alpha)p(\alpha)}= \frac{\textcolor{red}{p(\mathcal{D} \mid \theta,\alpha)} \textcolor{blue}{p(\theta \mid \alpha)}}{p(\mathcal{D} \mid \alpha)} \propto p(\mathcal{D} \mid \theta,\alpha) p(\theta \mid \alpha),$which can be written as $\begin{align}p(\boldsymbol{\theta} \mid \mathcal{D}, \boldsymbol{\alpha}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}, \boldsymbol{\alpha})}{\int p(\mathcal{D} \mid \boldsymbol{\theta}, \boldsymbol{\alpha}) p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) \, d\boldsymbol{\theta}} \cdot p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}),\end{align}$with $p(\mathcal{D} \mid \boldsymbol{\theta}, \boldsymbol{\alpha}) = \prod_{k=1}^{n} p(x_k \mid \boldsymbol{\theta})$ for **multiple observations** ([[Statistical Independence|i.i.d.]]). - The denominator is often impossible to compute efficiently, because is scales exponentially with the number of parameters ! - **Bayesian Prediction** - After we compute the posterior, we can predict the distribution of a new sample $\tilde{x}$ based on the observations by marginalizing over the posterior via the **posterior predictive distribution**$p(\tilde{x} \mid \mathcal{D},\alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathcal{D},\alpha) d\theta.$ - Instead marginalizing over the prior yields the **prior predictive distribution**$p(\tilde{x} \mid \alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \alpha) d\theta.$ - **Toy Example** - Assume prior that one of two success models is correct, one optimistic with $\theta=0.7$ and one pessimistic with $\theta = 0.2$. We begin by assuming a [[Probability Distribution|uniform distribution]] over these models $\alpha \sim \mathcal{U}$. If we observe a binary variable (success and failure), this yields the initial table![[Pasted image 20240117104519.png|center|300]]with the marginal likelihood $P(Data)$ and the prior $P(Model)$ - Assume we observe a success, we want to compute the posterior $P(Model|Data)$. This can be done via $P(\theta=0.7|Success)=0.7\cdot0.5/(0.35+0.1)=7/9$ and $P(\theta=0.2|Success)=0.2\cdot0.5/(0.35+0.1)=2/9$. --- #### Posterior Approximation Approach in [[Bayes Theorem|Bayesian]] statistics and [[The Machine Learning Pipeline|machine learning]] to update [[Bayes Theorem|prior]] belief based on integration of new evidence (data), leading to [[Bayes Theorem|posterior]]. In most applications, this would require **[[Marginal Distribution|marginalizing]] over the whole parameter / [[Latent Variable Models|latent]] [[Space|space]], which is infeasible** for all but simple systems. Therefore, many approximative approaches have been developed. - **[[Variational Inference]]** - Category of algorithms in statistics and Bayesian machine learning that tries to solve the problem of dealing with intractable [[Probability Distribution|distributions]]. Variational inference seeks to **approximate the problematic distribution** $p$ **by another** approximate distribution $q$. - [[Stein Variational Gradient Descent - A General Purpose Bayesian Inference Algorithm]] - **[[Markov-Chain Monte Carlo Methods]]** - Category of algorithms for sampling from a [[Probability Distribution|probability distribution]] by constructing a [[Markov Chain and Kernel]] that has the desired distribution as its [[Markov Chain and Kernel|equilibrium distribution]]. - [[Hamiltonian Monte Carlo]] - [[No-U-Turn Sampler (NUTS)]] - [[Markov-Chain Monte Carlo Methods]] - [[Metropolis-Adjusted Langevin Algorithm]] - [[LFI - Likelihood-Free Inference]] - **[[Approximate Bayesian Computation]]** -