Normal Approximation to the Posterior Distribution
In this post, I'm going to write about how the ever versatile normal distribution can be used to approximate a Bayesian posterior distribution. Unlike some other normal approximations, this is not a direct application of the central limit theorem. The result has a straight forward proof using Laplace's Method whose main ideas I will attempt to present. I'll also simulate a simple scenario to see how it works in practice.
Background¶
We're going to first start by reviewing some simple terminology and definitions regarding Bayesian methods to make the discussion later a bit easier to follow. The main problem we're trying to solve: given probability model $\mathcal{M}$ (for example a normal distribution) and some data denoted by $y$ (such as observations of customer purchases or readings from a machine), we wish to find a probability distribution of the parameters $\theta$ (e.g. $\mu$ and $\sigma$ in the case of a normal distribution). The basic idea is given by Bayes theorem:
$$ P(\theta | y) = \frac{P(y | \theta) P(\theta)}{P(y)} = C \cdot P(y | \theta) P(\theta) \tag{1} $$Some definitions [1]:
- $ P(\theta | y) $ is called the posterior distribution.
- $P(y | \theta)$ is called the likelihood function.
- $P(\theta)$ is called the prior distribution.
- $P(y)$ is called the marginal likelihood.
Notice the second form in Equation 1 where $\frac{1}{P(y)}$ term is replaced by a constant $C$. This is usually written out like this because once you've collected all your data, it's fixed. Thus, the probability of it occurring does not change with respect to the parameters of your model $\theta$. Think of it as a normalizing constant to make the posterior have a proper probability distribution (i.e. sum to $1$).
Bayesian inference usually follows these high level steps:
- Decide on a probability model $\mathcal{M}$.
- Decide on a prior distribution that encodes your previous knowledge about the problem. For example, if we're modeling the average conversion rate of an email campaign, we might look at previous campaigns and find a number around (or a distribution centered at) 1%.
- Given samples $y$, find the posterior distribution of your model parameters $\theta$ according to Equation 1. This is usually done via simulation except for the most basic cases where the posterior has a closed form expression.
Of course, there is a lot of nuance to each of these steps but by and large this is what usually happens.
The big contrast between this method and more traditional statistics methods (frequentist methods) is that the latter focuses more on the likelihood function. A popular estimate for $\theta$ is usually given by the maximum likelihood estimate (MLE), where it tries to find $\theta$ that maximizes the likelihood function. There's a huge amount of literature written on the differences between the two but here are some of the bigger points:
- The MLE estimate provides a point estimate (a single value) of the parameter (or sometimes a confidence interval). The Bayesian method provides a proper probability distribution for the parameter via the posterior distribution.
- The Bayesian method is "more" subjective because of the use of the prior. The reasoning is that you should use everything you know about the problem when conducting inference. However, finding a prior that accurately represents your state of knowledge is definitely a challenge.
- The MLE is in a way still "subjective" because the choice of model $\mathcal{M}$ is an implicit "prior".
- Bayesian methods treat the model parameters as random variables (i.e. the posterior) while frequentist methods usually treat the parameters as fixed (with the assumption that there is some theoretical "true" value) and what you're trying to find is a confidence interval that "traps" the fixed value a certain percentage of the time.
With that incredibly brief introduction to Bayesian methods, let's see how we can approximate the posterior with a normal distribution. There are a ton of resources to learn more about Bayesian methods, two references I've read through/reading are Bayesian Data Analysis and Probability Theory: The Logic of Science in the references below.
Normal Approximation to the Posterior Distribution [2]¶
The normal approximation for the posterior distribution can be used in several ways. The first is directly as an approximation of the posterior. This usually works well in low dimensional $\theta$ parameter spaces. Second, even for high dimensional parameter spaces, it may also work well when computing the marginal distribution across one of the components of $\theta$. Third, the reason you might want an approximation at all is that for more complex models, a closed form analytic solution may not be possible. This is usually solved by some kind of MCMC simulation that allows you to draw an arbitrary number of samples that will allow you to empirically build the posterior distribution. Having an approximate analytically normal distribution does have advantages over just a collection of samples. Additionally, a normal approximation can be used to help debug a solution using a more complex model or method to see if the complex method is approximately correct.
Assumptions¶
The key assumption is that we have independent and identically distributed (i.i.d) data, $y_1, \ldots, y_n$, drawn from the "true distribution", which we label as $f(\cdot)$ for its density function. These results only hold given that we have a "true distribution" with fixed parameters from which the data was sampled. This assumption is typical in classical statistical analysis, although not as common in Bayesian analysis where the parameters are treated as random variables and hence don't necessarily have a "true distribution" with fixed parameters. If this is not the case, it isn't hard to pick some specific values of $y$ that violate the result presented here.
If the "true distribution" $f(\cdot)$ actually comes from the same family of distributions of our model $\mathcal{M}$, then the "true distribution" should have a fixed unique parameter $\theta_0$. We expect our posterior distribution to approach $\theta_0$ as $n$ the number of samples increase. If $f(\cdot)$ is not part of the same distribution as our model, then we expect that there is a unique value $\theta_0$ in our model that minimizes the "discrepancy" between our model and the true distribution given by $f(\cdot)$. The discrepancy between the true distribution $F$ and our model distribution $\mathcal{M}$ is measured by the Kullback-Leibeler (KL) divergence:
$$ KL(F || \mathcal{M}) = KL(\theta) = E[log(\frac{f(y_i)}{p(y_i|\theta)})] = \int log(\frac{f(y_i)}{p(y_i|\theta)})f(y_i) dy_i \tag{2} $$If we denote our model density by $p(\cdot)$, then we can see KL divergence is minimized when $p(y_i) = f(y_i)$ and the $log$ evaluates to $0$. When they're not equal, if the densities are close to each other, the $log$ factor is small and so is KL divergence. Otherwise, we get a larger value.
All we're basically saying is that there is some $\theta_0$ that our posterior will approach to as we increase $n$. This will either be the exact value of the "true distribution" if our model is correct, or the value that minimizes the KL divergence with the "true distribution" (the best possible value we could hope for in our given model).
There are also a couple of other assumptions about the "regularity" of the likelihood function and prior. A few examples are that the likelihood function is continuous, $\theta_0$ is not on the boundary of the parameter space, the prior is non-zero at the point of convergence. See section 4.3 in Bayesian Data Analysis for a more complete treatment. They basically boils down to the fact that we should have a smooth function, well peaked about its unique maximum.
Proof Outline¶
We'll break the proof down into three parts. The first two parts basically show that our posterior distribution mass will be tightly concentrated around the theoretical $\theta_0$ value. Once we know that the posterior will be concentrated around $\theta_0$, the third part will show how a normal approximation about the posterior mode will be a good approximation to the actual posterior distribution.
Theorem 1: If the parameter space $\Theta$ is finite and $P(\theta = \theta_0) > 0$ (prior is non-zero at $\theta_0$), then $P(\theta=\theta_0 | y) \rightarrow 1$ as $n \rightarrow \infty$, where $\theta_0$ is the value that uniquely minimizes the KL divergence.
Proof:
As a recap, as we get more data points $y$, our estimate of the distribution of $\theta$ should get more accurate i.e. the posterior distribution $P(\theta|y)$ gets more mass around $\theta_0$. For a finite parameter space ($\theta \in \Theta$), each value of $\theta$ will be assigned some probability of occurring (given the data). If we take the assumption that the data was drawn from a "true distribution" with fixed parameters $\theta_0$ (or in the case where this assumption is off, the value that minimizes the KL divergence), then we are claiming that all the probability mass tend toward this single value.
To show this, we instead show that the posterior probability mass of all values $\theta$ except $\theta_0$ will tend toward $0$ as $n$ grows. First start off with the log posterior odds of $\theta$ vs. $\theta_0$ expanding and simplifying using Equation 1:
$$ log\Big(\frac{P(\theta|y)}{P(\theta_0|y)}\Big) = log\Big(\frac{P(\theta)}{P(\theta_0)}\Big) + \sum_{i=1}^{n} log\Big(\frac{P(y_i|\theta)}{P(y_i|\theta_0)}\Big) \tag{3} $$Looking at the second term on the RHS, we see that it is a sum of $n$ i.i.d. variables, whose mean is given by:
\begin{align} E\Big[\sum_{i=1}^{n} log\Big(\frac{P(y_i|\theta)}{P(y_i|\theta_0)}\Big)\Big] &= \sum_{i=1}^{n} E\Big[log\Big(\frac{P(y_i|\theta)}{P(y_i|\theta_0)}\Big)\Big] \\ &= \sum_{i=1}^{n} E\Big[log\Big(\frac{P(y_i|\theta)}{f(y_i)}\frac{f(y_i)}{P(y_i|\theta_0)}\Big)\Big] \\ &= \sum_{i=1}^{n} E\Big[log\Big(\frac{f(y_i)}{P(y_i|\theta_0)}\Big) - log\Big(\frac{f(y_i)}{P(y_i|\theta)}\Big) \Big] \\ &= \sum_{i=1}^{n} \big(KL(\theta_0) - KL(\theta)\big) \tag{4} \end{align}Since $\theta_0$ is a unique minimizer for KL divergence, the difference in Equation 4 is always negative unless $\theta = \theta_0$. By the law of large numbers, the average of Equation 4 will approach $KL(\theta_0) - KL(\theta)$. As $n \rightarrow \infty$, this means the summation in Equation 4 will approach $-\infty$ for all $\theta \neq \theta_0$. Going back to Equation 3, so long as the prior is finite ($P(\theta_0) > 0$), Equation 3 will also tend towards $-\infty$. This means $\frac{P(\theta|y)}{P(\theta_0|y)} \rightarrow 0$ for all $\theta \neq \theta_0$, meaning that $P(\theta|y) \rightarrow 0$. Since all probabilities sum to $1$, this implies that $P(\theta_0|y) \rightarrow 1$. $\blacksquare$
Theorem 2: If $\theta$ is defined on a compact set and $A$ is a neighborhood of $\theta_0$ with nonzero prior probability, then $P(\theta \in A|y) \rightarrow 1 $ as $n \rightarrow \infty$, where $\theta_0$ is the value of $\theta$ that minimizes KL divergence.
Proof Sketch:
This proof is similar to the one before except we're dividing the continuous parameter space into neighborhoods. We use a similar argument as above to show that all neighborhoods except for the ones containing $\theta_0$ have probability tending towards $0$. $\blacksquare$
So far, we have only shown that most of the probability is concentrated around $\theta_0$. We still need to show that we can approximate it by a normal distribution.
Theorem 3 (Bernstein-von Mises Theorem [3]): Under some regularity conditions, as $n \rightarrow \infty$, the posterior distribution of $\theta$ approaches normality with mean $\theta_0$ and variance $(nI(\theta_0))^{-1}$, where $\theta_0$ is the value that minimized the KL divergence and I is the Fisher information.
Proof:
We first consider the posterior mode (the value $\theta$ with the highest probability or the "peak"), call it $\hat{\theta}$. Without proof, we state that $\hat{\theta}$ is consistent, that is, as $\hat{\theta} \rightarrow \theta_0 $ as $n \rightarrow \infty$.
The crux of the argument is that we can approximate the log posterior density using a Taylor approximation up to the quadratic term centered at the posterior mode, which when translated back to a non-log scale is a normal distribution. This technique is called Laplace's Method and can be used for approximating things other than density functions. It's relatively simple and quite elegant, let's take a look at how it works.
First, start with the Taylor expansion up to the quadratic term of the log posterior density centered at $\hat{\theta}$:
\begin{align} log\,P(\theta|y) &\approx log\,P(\hat{\theta}) + (\theta - \hat{\theta}) \frac{d}{d\theta}[log\,P(\theta|y)]_{\theta=\hat{\theta}} \\ &\quad\qquad\qquad + \frac{1}{2}(\theta - \hat{\theta})^2 \frac{d^2}{d\theta^2}[log\,P(\theta|y)]_{\theta=\hat{\theta}} \\ &= log\,P(\hat{\theta}) + \frac{1}{2}(\theta - \hat{\theta})^2 \frac{d^2}{d\theta^2}[log\,P(\theta|y)]_{\theta=\hat{\theta}} \\ &= const + \frac{1}{2}(\theta - \hat{\theta})^2 \frac{d^2}{d\theta^2}[log\,P(\theta|y)]_{\theta=\hat{\theta}} \\ &= const - \frac{1}{2}\frac{(\theta - \mu)^2}{\sigma^2} \\ \tag{5} \end{align}Notice that the first order term, $(\theta - \hat{\theta}) \frac{d}{d\theta}[log\,P(\theta|y)]_{\theta=\hat{\theta}} =0$ because the first derivative at the mode is $0$ ($0$ slope at a maximum).
Simplifying a bit more, we can see that we have something that looks like a normal distribution with $\mu = \hat{\theta}$ and $\sigma^2 = -(\frac{d^2}{d\theta^2}[log\,P(\theta|y)]_{\theta=\hat{\theta}})^{-1} $. The inner expression of the latter is just the observed information, which is approximately $nI(\hat{\theta})$ for large $n$, where $I$ is the Fisher Information.
Using our Taylor expansion in Equation 5, as $n\rightarrow \infty$, the mass of the posterior distribution $P(\theta|y)$ becomes concentrated around $\theta_0$ according to Theorem 1 and 2. Thus, the distance between the posterior mode and $\theta_0$ gets smaller, i.e. $|\hat{\theta} - \theta_0| \rightarrow 0$. When this happens, the third and higher order terms of the Taylor expansion fade in importance the quadratic term dominates. Thus as $n \rightarrow \infty$, the Taylor expansion about the posterior mode up to the quadratic term becomes an increasingly accurate approximation of the log posterior distribution.
Finally, noticing that for large $n$:
\begin{align} log\,P(\theta|y) &\approx log\,P(\hat{\theta}) + \frac{1}{2}(\theta - \hat{\theta})^2 \frac{d^2}{d\theta^2}[log\,P(\theta|y)]_{\theta=\hat{\theta}} \\ &= const - \frac{1}{2}\frac{(\theta - \hat{\theta})^2}{(nI(\hat{\theta}))^{-1}} \\ P(\theta|y) &\approx exp\{C + \frac{(\theta - \hat{\theta})^2}{2(nI(\hat{\theta}))^{-1}}\} \\ &= N(\hat{\theta}, (nI(\hat{\theta}))^{-1}) \tag{8} \end{align}$\blacksquare$
Simulations¶
Now that we have some theory underneath our belt, let's see how this approximation works in practice. The setup will be drawing samples from some "true distribution". Choosing a prior on the parameters such that it has non-zero mass at the actual parameter value $\theta_0$. And finally, comparing the normal approximation above with the exact simulation. Hopefully, we should get something that matches pretty closely.
Let's show a simple example using an exponential distribution parameterized by $\lambda$ with its conjugate prior (gamma distribution) since it will be pretty easy to work with. Let's define all the terms before we try to code it up. The prior is a gamma distribution is parameterized by $\alpha, \beta$. Intuitively, $\alpha$ can be interpreted as the number of prior observations, and $\beta$ can be the sum of those observations. Let's use a relatively uninformative prior by setting these hyper parameters $\alpha=1, \beta=1$:
\begin{align} P(\lambda) = Gamma(\lambda; \alpha=1, \beta=1) \tag{9} \end{align}Next, let's define our posterior density function:
\begin{align} P(\lambda|y) &= \frac{L(y|\lambda) Gamma(\lambda; \alpha=1, \beta=1)}{P(y)} \\ &= Gamma(\lambda; \alpha=n+1, \beta=n\bar{y} + 1) \tag{10} \end{align}where the simplification occurs because we know that the Gamma distribution is the conjugate prior of the exponential distribution and $n$ is the number of observations in $y$ and $\bar{y}$ is their mean. The posterior mode $\hat{\lambda}$ is then just the mode of the Gamma distribution:
\begin{align} \hat{\lambda} = \frac{\alpha - 1}{\beta} = \frac{n+1 - 1}{n\bar{y} + 1} = \frac{n}{n\bar{y} + 1} \tag{11} \end{align}To find the Fisher Information $I(\lambda)$, we have to look at the second derivative of the log posterior function (a Gamma distribution). First, the first derivative:
\begin{align} P(\lambda|y) &= \frac{\beta^\alpha}{\Gamma(\alpha)}\lambda^{\alpha - 1} e^{-\beta \lambda} \\ log\,P(\lambda|y) &= C + (\alpha - 1)log\,\lambda - \beta \lambda \\ \\ \frac{dlog\,P(\lambda|y)}{d\lambda} &= \frac{(\alpha - 1)}{\lambda} - \beta \tag{12} \end{align}Notice that by setting the first derivative of the log posterior to zero and solving for $\lambda$, gives us exactly the expression for the mode above in Equation 11, where the density has a unique maximum at the $\lambda_0$ corresponding to the mode of our posterior distribution. Now taking another derivative we can find the estimate of our variance (which is proportional to the Fisher Information):
\begin{align} nI(\lambda=\hat{\lambda}) &\approx -\frac{d^2log\,P(\hat{\lambda}|y)}{d\lambda^2} \\ &= \frac{(\alpha - 1)}{\hat{\lambda}^2} \\ &= \frac{n}{\hat{\lambda}^2} \\ &= \frac{n}{\hat{\lambda}^2} \\ &= \frac{({n\bar{y} + 1})^2}{n} \tag{13} \end{align}So according to our work, our posterior distribution should be approximated by $N(\frac{n}{n\bar{y} + 1}, \frac{n}{(n\bar{y} + 1)^2)})$ according to Equation 8. Let's simulate some of this to see if it's correct.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
lamb=3.14
plt.figure(1, figsize=(12, 10))
for n, pltnum in [(10, 221), (50, 222), (100, 223), (1000, 224)]:
# Generate some samples from our "true distribution"
y = stats.expon.rvs(scale=1/lamb, size=n)
# Generate our posterior density
# (Of course for a more complex model we would probably
# draw samples instead of using the analytical form)
alpha = n + 1
beta = n * y.mean() + 1.
post = stats.gamma(alpha, scale = 1. / beta)
# Plot the posterior
plt.subplot(pltnum)
x = np.linspace(post.ppf(0.001), post.ppf(0.999), 1000)
pd.Series(post.pdf(x), index=x).plot(alpha=0.5, label='posterior')
# Generate our normal approximation
mu = (alpha - 1.) / beta
sigma2 = n / beta ** 2
norm = stats.norm(loc=mu, scale=np.sqrt(sigma2))
# Plot normal approximation
x = np.linspace(norm.ppf(0.001), norm.ppf(0.999), 1000)
pd.Series(norm.pdf(x), index=x).plot(alpha=0.5, label='norm')
plt.title('Normal Approx. to the Posterior (N=%d, l=%.2f)' % (n, lamb))
plt.legend(loc='best')
The normal distribution is a pretty good approximation to the posterior as $n$ grows larger. Also notice that the posterior distribution gets closer and closer to the "true" value of the parameter $3.14$ as we would expect from a bigger sample size $n$. The posterior's variance around the mode is also much tighter as our estimate of $\lambda$ gets better with more data.
Conclusion¶
The normal distribution keeps popping up time and time again. From the central limit theorem, one would expect that it occurs in many different large sample problems. This post shows that there is another instance where it provides a good approximation using a different mechanic (Laplace's Method). It's a very interesting application that I found relatively simple and elegant. Hope you have found it interesting too!
References and Further Reading¶
- Wikipedia: Bayesian Inference, Maximum Likelihood Estimate, exponential distribution, gamma distribution, Fisher Information, observed information, Laplace's Method
- Bayesian Data Analysis, Gelman, Carlin, Stern.
- Stack Exchange: Why exactly is the observed Fisher information used?, Intuitive explanation of Fisher Information and Cramer-Rao bound
- Laplace Approximation to the Posterior, Surya Tapas Tokdar, STA 250 Duke University
- Probability Theory: The Logic of Science (Chapters 1-3) by E. T. Jayne.
Notes¶
In terms of notation and terminology, I'm going to use an imprecise notation/terminology where I do not distinguish between distribution and density. It should be clear from the context when I refer to a distribution or density, I am either referring to a CDF or PDF. Also in terms of notation I use $P(\cdot)$ to denote the density in many cases although probably we should be using something like $p$ or $f$ but I think it should be clear from the context it occurs.↩
Most of this section was based on the proof in Appendix Outline of proofs of limit theorems in Bayesian Data Analysis as well as Laplace Approximation to the Posterior in the references. I recommend taking a look at them for more details.↩
I couldn't find many good (simple) resources describing the Bernstein-von Mises Theorem except Laplace Approximation to the Posterior. I haven't really read any papers on it but I think the general idea is probably contained in this theorem.↩