>[!quote] In a Nutshell
>Theory in statistics based on the Bayesian interpretation of probability, where probability expresses degree of belief in an [[Sigma-Algebra|event]] ([[Frequentist vs. Bayesian Statistics]]). The degree of belief may be based on prior knowledge about the event, such as the results of previous experiments, or on personal beliefs about the event.
$\textcolor{purple}{\text{Posterior}}=\frac{\textcolor{red}{\text{Likelihood}}\times\textcolor{blue}{\text{Prior}}}{\text{Evidence}}$
---
#### Bayesian Inference
Method of statistical inference that uses [[Bayes Theorem]] to update probability for a hypothesis as more evidence or information becomes available.
- **Formal Definition with Parametric Distributions**
- Data points $x$ are distributed $x \sim p(x|\theta)$ using parametric distributions with hyperparameters $\theta \sim p(\theta | \alpha)$ (prior) with fixed $\alpha$. Assume we observed a set of $n$ samples $\mathcal{D}= \{ x_{1},...,x_{n}\}$ and want to update our prior in order to predict the distribution for a new sample $\tilde{x}$. This is done via $\textcolor{purple}{p(\theta \mid \mathcal{D},\alpha)} = \frac{p(\mathcal{D}\mid\theta,\alpha)p(\theta,\alpha)}{p(\mathcal{D}\mid\alpha)p(\alpha)}= \frac{\textcolor{red}{p(\mathcal{D} \mid \theta,\alpha)} \textcolor{blue}{p(\theta \mid \alpha)}}{p(\mathcal{D} \mid \alpha)} \propto p(\mathcal{D} \mid \theta,\alpha) p(\theta \mid \alpha),$which can be written as $\begin{align}p(\boldsymbol{\theta} \mid \mathcal{D}, \boldsymbol{\alpha}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}, \boldsymbol{\alpha})}{\int p(\mathcal{D} \mid \boldsymbol{\theta}, \boldsymbol{\alpha}) p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) \, d\boldsymbol{\theta}} \cdot p(\boldsymbol{\theta} \mid \boldsymbol{\alpha}),\end{align}$with $p(\mathcal{D} \mid \boldsymbol{\theta}, \boldsymbol{\alpha}) = \prod_{k=1}^{n} p(x_k \mid \boldsymbol{\theta})$ for **multiple observations** ([[Statistical Independence|i.i.d.]]).
- The denominator is often impossible to compute efficiently, because is scales exponentially with the number of parameters !
- **Bayesian Prediction**
- After we compute the posterior, we can predict the distribution of a new sample $\tilde{x}$ based on the observations by marginalizing over the posterior via the **posterior predictive distribution**$p(\tilde{x} \mid \mathcal{D},\alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \mathcal{D},\alpha) d\theta.$
- Instead marginalizing over the prior yields the **prior predictive distribution**$p(\tilde{x} \mid \alpha) = \int p(\tilde{x} \mid \theta) p(\theta \mid \alpha) d\theta.$
- **Toy Example**
- Assume prior that one of two success models is correct, one optimistic with $\theta=0.7$ and one pessimistic with $\theta = 0.2$. We begin by assuming a [[Probability Distribution|uniform distribution]] over these models $\alpha \sim \mathcal{U}$. If we observe a binary variable (success and failure), this yields the initial table![[Pasted image 20240117104519.png|center|300]]with the marginal likelihood $P(Data)$ and the prior $P(Model)$
- Assume we observe a success, we want to compute the posterior $P(Model|Data)$. This can be done via $P(\theta=0.7|Success)=0.7\cdot0.5/(0.35+0.1)=7/9$ and $P(\theta=0.2|Success)=0.2\cdot0.5/(0.35+0.1)=2/9$.
---
#### Posterior Approximation
Approach in [[Bayes Theorem|Bayesian]] statistics and [[The Machine Learning Pipeline|machine learning]] to update [[Bayes Theorem|prior]] belief based on integration of new evidence (data), leading to [[Bayes Theorem|posterior]]. In most applications, this would require **[[Marginal Distribution|marginalizing]] over the whole parameter / [[Latent Variable Models|latent]] [[Space|space]], which is infeasible** for all but simple systems. Therefore, many approximative approaches have been developed.
- **[[Variational Inference]]**
- Category of algorithms in statistics and Bayesian machine learning that tries to solve the problem of dealing with intractable [[Probability Distribution|distributions]]. Variational inference seeks to **approximate the problematic distribution** $p$ **by another** approximate distribution $q$.
- [[Stein Variational Gradient Descent - A General Purpose Bayesian Inference Algorithm]]
- **[[Markov-Chain Monte Carlo Methods]]**
- Category of algorithms for sampling from a [[Probability Distribution|probability distribution]] by constructing a [[Markov Chain and Kernel]] that has the desired distribution as its [[Markov Chain and Kernel|equilibrium distribution]].
- [[Hamiltonian Monte Carlo]]
- [[No-U-Turn Sampler (NUTS)]]
- [[Markov-Chain Monte Carlo Methods]]
- [[Metropolis-Adjusted Langevin Algorithm]]
- [[LFI - Likelihood-Free Inference]]
- **[[Approximate Bayesian Computation]]**
-