A Note on Using Log-Likelihood for Generative Models

One of the things that I find is usually missing from many ML papers is how they relate to the fundamentals. There's always a throwaway line where it assumes something that is not at all obvious (see my post on Importance Sampling). I'm the kind of person who likes to understand things to a satisfactory degree (it's literally in the subtitle of the blog) so I couldn't help myself investigating a minor idea I read about in a paper.

This post investigates how to use continuous density outputs (e.g. a logistic or normal distribution) to model discrete image data (e.g. 8-bit RGB values). It seems like it might be something obvious such as setting the loss as the average log-likelihood of the continuous density and that's almost the whole story. But leaving it at that skips over so many (interesting) and non-obvious things that you would never know if you didn't bother to look. I'm a curious fellow so come with me and let's take a look!

Review of Some Fundamental Concepts

Before we get into the meat of the topic, I want to spend some time on some basic questions regarding mixing continuous and discrete data (inspired from [2,3]):

How you find the probability of seeing observations drawn from a continuous distributions? e.g. What's the probability of drawing \(x=0\) from a standard normal distribution?

Seems like a simple problem, right? Well, there are lots of paradoxes lurking here, so let's clarify them before moving on.

Zero Probability Events

For example, assume you have sampled some data from an IID continuous distribution, what is the probability of that happening? Let's reason it out.

To make things concrete, assume we have a \(U\) continuous uniform distribution with support \([0,1]\). Let's draw a sample \(x\) from \(U\), what is the probability that we drew \(x\)? Well we know a few things about this distribution, in particular, it's probability density function and how to compute probability from it:

\begin{align*} f_U(y) &= \frac{1}{b-a} = \left\{ \begin{array}{ll} 1 && \text{for support } [a,b] \\ 0 && \text{otherwise} \end{array} \right. \tag{1} \\ P(u \leq y \leq v) &= \int_u^v f_U(y) dy = y \big|_u^v = v - u && \text{for } u, v \in [0,1] \tag{2} \end{align*}

Very fundamental indeed! So we know how to compute probability (Equation 2) over a distribution if we have a continuous interval, but that doesn't quite get us to a single data point. Let's take the limit of Equation 2 to see how the interval behaves as it approaches a single data point:

\begin{align*} P(x|U) &= \lim_{\epsilon \to 0}{P(x - \epsilon \leq y \leq x + \epsilon)} \\ &= \lim_{\epsilon \to 0} \int_{x-\epsilon}^{x+\epsilon} f_U(y) dy \\ &= \lim_{\epsilon \to 0} y \big|_{x-\epsilon}^{x+\epsilon} \\ &= \lim_{\epsilon \to 0} { (x + \epsilon) - (x - \epsilon) } \\ &= 0 \\ \tag{3} \end{align*}

So the probability of \(x\) occurring is \(0\)?!

Yes! Which is a bit of a paradox since we did indeed draw it from a distribution, so it definitely "happened". This is also not particular to the uniform distribution, if you do the same with any (reasonable) distribution, you'll find the same thing. How can this be possible?

Another way to think about this: what if the probability of observing \(x\) wasn't \(0\) but rather some positive number \(\epsilon\)? That means every number in the range \([0, 1]\) would have some positive probability of being observed. However, all those numbers/events are mutually exclusive, so the sum of them should add up to \(1\) but this isn't possible because you have infinitely many of them, so each one must be infinitely small. Infinities are weird.

Hypothesis Testing with Continuous Data and Log-Likelihoods

So how can resolve this paradox? As usual, we're asking the wrong question! When we have observed an event, what we're really asking is "what is the probability that this event is generated by this hypothesis?". In other words, we're really doing hypothesis testing! (Here, we're talking about statistical hypotheses, which is an assumption about a distribution with a particular set of parameters.)

So let's try this again but a bit more generically. Let's say we have observed a real-valued datum \(x\) (this can easily be extended to multiple points given IID data), and we have \(N\) hypothesis, \(M_1, M_2, \ldots, M_N\), each representing a fixed parameter distribution with PDFs \(f_i\) and CDFs \(F_i\).

(Note: we'll use the notation "\(M\)" here because it also implies that our hypotheses are trained models. That is, a set of assumptions about the data with a particular set of trained parameters. You can start to see how this is going to lead us to generative models...)

Let's formulate our question: "what is the probability of hypothesis \(M_1\) given datum \(x\)" (using Bayes rule):

\begin{align*} P(M_1 | x) &= \frac{P(x | M_1)P(M_1)}{P(x)} \\ &= \frac{P(x | M_1)P(M_1)}{\sum_{i=1}^N P(x | M_i)P(M_i)} \\ \tag{4} \end{align*}

Here we're using the standard expansion of the denominator for \(P(x)\). As with Bayesian analysis, we need some prior for how likely each model occurs. Let's just assume we have no intuition about which model is better, so they're equally likely. Since we're also dealing with continuous values, we'll use the "\(\epsilon\) trick" we used above:

\begin{align*} P(M_1 | x) &= \frac{P(x | M_1)P(M_1)}{\sum_{i=1}^N P(x | M_i)P(M_i)} \\ &= \lim_{\epsilon \to 0} \frac{[\int_{x-\epsilon}^{x+\epsilon} f_1(x) dx][\frac{1}{N}]}{ \sum_{i=1}^N [\int_{x-\epsilon}^{x+\epsilon} f_i(x) dx][\frac{1}{N}]} \\ &= \lim_{\epsilon \to 0} \frac{F_1(x+\epsilon) - F_1(x-\epsilon)}{\sum_{i=1}^N \big(F_i(x+\epsilon) - F_i(x-\epsilon)\big)} \\ &= \frac{\lim_{\epsilon \to 0} \frac{F_1(x+\epsilon) - F_1(x-\epsilon)}{\epsilon}}{ \sum_{i=1}^N \lim_{\epsilon \to 0} \big(\frac{F_i(x+\epsilon) - F_i(x-\epsilon)}{\epsilon}\big)} && \text{divide top and bottom by } \epsilon \text{ and distribute limit} \\ &= \frac{f_1(x)}{\sum_{i=1}^N f_i(x)} && \text{definition of derivative} \\ \tag{5} \end{align*}

You might have to brush off your calculus a bit with the comments above, but I think you should be able to follow along. The last step is not the typical definition of a derivative but it's should be equivalent. (Note: this derivative probably only works for smooth functions.)

The interesting thing here is that we've totally resolved the problem of dealing with continuous data! We're dealing only with PDFs now and removed the zero probability case when looking at \(P(x|M_i)\) in isolation.

Example 1: Probability an Observation is from a Given Two Gaussian Hypotheses

Suppose we have a datum \(x=0\) that we know is drawn from one of two Gaussian distributions:

  • \(M_1: \mathcal{N}(\mu=0, \sigma^2=1)\)

  • \(M_2: \mathcal{N}(\mu=1, \sigma^2=1)\)

What is the probability of \(x\) being drawn from each distribution (assuming our priors are equally likely)? Equivalently, what is \(P(M_1 | x)\) and \(P(M_2 | x)\) with \(P(M_1)= P(M_2)=0.5\)?

Using Equation 5 it is simply just the PDFs of the two normalized by our priors:

\begin{align*} P(M_1|x) &= \frac{f_{M_1}P(M_1)}{f_{M_1}P(M_1)+ f_{M_2}P(M_2)} \\ &= \frac{-e^{x^2/2}(0.5)}{e^{-x^2/2}(0.5) + e^{-(x-1)^2/2}(0.5)} &\approx 0.6225 \\ P(M_2|x) &= \frac{f_{M_2}P(M_2)}{f_{M_1}P(M_1)+ f_{M_2}P(M_2)} \\ &= \frac{-e^{(x-1)^2/2}(0.5)}{e^{-x^2/2}(0.5) + e^{-(x-1)^2/2}(0.5)} &\approx 0.3775 \tag{6} \end{align*}

Therefore, \(x\) is more likely to be drawn from distribution \(M_1\).

Before we finish off this section, we should notice one thing. Given a fixed set of hypotheses (or models) the denominator in Equation 5 doesn't change. That is, \(P(X)\) is constant. Therefore, assuming all models are equally likely, we can do a relative comparison of models just by their PDFs. From Equation 5:

\begin{equation*} P(M_i | x) = \frac{f_i(x)}{\sum_{i=1}^N f_i(x)} \propto f_i(x) \tag{7} \end{equation*}

Further, given a IID data for \(x\), we can do a relative comparison of (fitted) models by taking the logarithm:

\begin{align*} P(M_i | x_1 \ldots x_n) &= \frac{f_i(x_1)\ldots f_i(x_n)}{\sum_{i=1}^N f_i(x_1)\ldots f_i(x_n)} \\ \log{P(M_i | x_1 \ldots x_n)} &= \log\big[\frac{f_i(x_1)\ldots f_i(x_n)}{\sum_{i=1}^N f_i(x_1)\ldots f_i(x_n)}\big] \\ &= \log{f_i(x_1)} + \ldots + \log{f_i(x_n)} - \log{[\sum_{i=1}^N f_i(x_1)\ldots f_i(x_n)]} \\ &\propto \log{f_i(x_1)} + \ldots + \log{f_i(x_n)} \\ &= \sum_{j=1}^n \log{f_i(x_j)} \\ \tag{8} \end{align*}

Where the last line of Equation 8 should look very familiar: it's the standard log-likelihood that we maximize in many ML and statistical models. Thus, we can directly compare how well a model represents some data using the loss from the log-likelihood as you would expect. (However, keep in mind it is not a probability.)

Cross Entropy and Expected Message Length

I won't go too deep into the concept of (information) entropy, you can find a more detailed explanation in my previous post on Maximum Entropy Distributions. Information Entropy is defined as follows over a discrete probability distribution \(P\):

\begin{equation*} H(p) = -\sum_{x \in \mathcal{X}} P(x) \log P(x) \tag{9} \end{equation*}

One of the interesting properties for discrete random variables is that the entropy provides a lower bound for how much (on average) those symbols can be compressed (via the Shannon Coding Theorem). This is squarely in the domain of information theory and communication systems.

For example, a very simplified application might be you might be sending a string of bytes to your friend over a Wifi channel, how should you encode each byte (i.e. symbol) into bits so that it minimizes the length of the message? Intuitively, you would want the most frequent byte values (symbols) to have the shortest length, the next most frequent byte to have a slightly longer length, etc. Huffman Coding is an example of such a scheme that enables lossless compression that is optimal. Optimality (that is, symbol-to-symbol optimality) is precisely the entropy lower bound.

A related concept is that of cross entropy: given two probability discrete distributions \(P\) and \(Q\) defined over the same set, we have:

\begin{align*} H(p,q) &= -\sum_{x \in \mathcal{X}} P(x) \log Q(x) \\ &= H(P) + D_{KL}(P||Q) \\ \tag{10} \end{align*}

This looks almost identical to the definition of entropy in Equation 9 except we replace \(P\) with \(Q\) inside the logarithm. As you would expect, this also has an equivalent interpretation: cross entropy gives the average number of bits (using base 2 logarithm) needed to code a symbol drawn from \(P\) using the optimal code for \(Q\). The second line of Equation 10 also shows us another interpretation of cross entropy: \(H(P,Q)\) is equal to entropy of \(P\) plus the KL divergence between \(Q\) from \(P\).

Cross entropy can also be defined for the continuous case too for continuous densities \(p\) and \(q\):

\begin{equation*} H(p, q) = -\int_{\mathcal{X}} P(x)\log Q(x) dx \tag{11} \end{equation*}

Note: We should be careful distinguishing between information entropy defined in Equation 9 on discrete variables and the continuous version called differential entropy. Differential entropy has a similar form but doesn't have the same nice intuitive meaning of encoding into bits. It also doesn't have nice properties, for example, differential entropy can be negative. A more interpretable quantity is the KL divergence, which is the "number of extra bits to encode P using Q". See this Stack Exchange question for more details.

Generative Models, Log-Likelihoods and Image Data

Evaluating generative models is a tricky subject mostly because there is no "one metric to rule them all". Unlike classifiers or regression problems, there is no singular concept of "accuracy" or "error". Generally, this is because we evaluate generative models in two broad ways (a) quality of the samples, and (b) average log-likelihoods. Both of these metrics do not necessarily track each other, in other words, we can have high log-likelihoods but low quality samples and vice versa (of course they can be high or low on both too). A more thorough discussion of this topic is in [1].

However in this post, I just want to focus on the average log-likelihood method for now, in particular, interpretations of it in terms of probability for image data. The (usual) reason why this is of more concern is that it's easy to measure. For example, how can we measure the quality of a generated image? There's no obvious ways to do it ([1] discusses a few approximations to it). That's why focusing on likelihood methods is so attractive (and perhaps misguided?) because it's easier to interpret and compare different models.

This section will go over two cases for image data in particular: models with discrete outputs and models with continuous outputs.

Discrete Models

Image data is naturally discrete. For a typical a 8-bit pixel (or subpixel), you have \(2^8 = 256\) possible values representing its intensity. This naturally lends itself well to a generative model outputting a discrete value. There are two primary ways (that I know of) to model these pixels.

The first is pretty simple: just have a 256-way softmax for each pixel with a cross entropy loss. This is the most straightforward and direct way to model each pixel. This is the method used in the original PixelCNN paper [4]. The main issue with this is that the resulting network is huge because you have 256 times the number of subpixels you have (e.g. \(32 x 32 x 3 x 256 = 3072 * 256 = 786432\)). This can't fit on any reasonably sized GPU. The other issue is that qualitatively, pixel intensity \(x \in [0, 255]\) should be close to \(x+1\), but if we model it as a softmax, they are more or less independent with respect to their loss function so your model doesn't capture this intuitive property. In any case, using this method should theoretically generate a good model if you can practically fit it.

The other method described in PixelCNN++ [5] uses a different tactic. They use a two step process: first model the intensity as a continuous distribution then "round" to the nearest pixel by integrating the continuous distribution in the region around the pixel. From my post on PixelCNN, the rounding step works like so (see the post for more details):

\begin{equation*} P(x|\mu,s) = \begin{cases} \sigma(\frac{x-\mu+0.5}{s}) & \text{for } x = 0 \\ \sigma(\frac{x-\mu+0.5}{s}) - \sigma(\frac{x-\mu-0.5}{s}) & \text{for } 0 < x < 255 \\ 1 - \sigma(\frac{x-\mu-0.5}{s}) & \text{for } x = 255 \end{cases} \tag{12} \end{equation*}

Here \(\sigma\) is the CDF of our continuous pixel intensity distribution parameterized by \(\mu, s\). To find the probability of a given pixel \(P(x|\mu,s)\), we simply integrate the distribution across a pixel width (i.e. take the difference of the CDFs).

This is actually a really elegant solution because we have the nice property adjacent pixels being similar to each other (assuming a smooth distribution) and we have a clear way to generate a probability. It also is much more parameter efficient (2 parameters vs. 256) but practically you'll need a more complex distribution. In the paper, they use a mixture of five logistic distributions, so it's 10 parameters vs. 256, still a win.

Finally, training the model is as simply as minimizing the negative log-likelihood of Equation 12 (for \(N\) IID images):

\begin{align*} \mathcal{L}({\bf \mu, s}) &= -\log P({\bf x_1, \ldots, x_N}) \\ &= -\log[P({\bf x_1}),\ldots, P({x_N})] \\ &= -\sum_{i=1}^N \log P({\bf x_i}|{\bf \mu, s}) \\ \tag{13} \end{align*}

If we want the average log-likelihood, we can divide Equation 13 by \(N\) to get:

\begin{align*} \frac{1}{N}\mathcal{L}({\bf \mu, s}) &= -\sum_{i=1}^N \frac{1}{N} \log P({\bf x_i}|{\bf \mu, s}) \\ &= -\sum_{i=1}^N p_{true}({\bf x_i}) \log P({\bf x_i}|{\bf \mu, s}) \\ \tag{14} \end{align*}

where \(p_{true} = \frac{1}{N}\) because we assume the sample dataset we is uniformly sampled from the "true distribution" of images. As you can see this directly gives us the cross-entropy from Equation 10. This means that we can directly interpret our average log-likelihood loss in terms of cross entropy, which gives us the "average number of bits (using base 2 logarithm) needed to code a sample from \(p_{true}\) using our model \(P\)". Dividing this by the number of pixels, gives us the "bits per pixel" metric that we see often in papers (e.g. [4], [5]).

Continuous Models

What do you do when your image data is discrete but your model outputs a continuous distribution (e.g. normal, logistic etc.)? When mixing discrete image data with continuous distributions you get zero probability events like we discussed above, which could lead to a infinite differential entropy during training 1. One alternative is the "rounding" method above. Another "trick" described in [1] (and previously in a few other papers), is to add uniform noise to de-quantize the data. Let's see how this works.

Suppose we have image data \({\bf x} \in \{0,\ldots,255\}^D\) with a discrete probability distribution \(P(X)\), uniform noise \({\bf u} \in [0,1]^D\), and define noisy data \({\bf y} = {\bf x} + {\bf u}\). Let \(p\) refer to the noisy data density and \(q\) refer to the continuous density output by our model. Let's take a look at the negative cross entropy of these two distributions:

\begin{align*} -H(p,q) &= \int p({\bf y})\log q({\bf y})d{\bf y} \\ &= \int_{\bf y} \int_{\bf v} p({\bf y})\log q({\bf y})d{\bf v}d{\bf y} && \text{add dummy variable } v \\ &= \int_{\bf x} \int_{\bf u} p({\bf x} + {\bf u})\log q({\bf x} + {\bf u})d{{\bf u}}d{\bf x} && \text{change of variable: } y=x+u, v=u \\ &= \sum_{\bf x} \int_{\bf u\in [0,1]^D} p({\bf x} + {\bf u})\log q({\bf x} + {\bf u})d{{\bf u}} \\ &= \sum_{\bf x} P({\bf x}) \int_{\bf u\in [0,1]^D} \log q({\bf x} + {\bf u})d{{\bf u}} && p({\bf x} + {\bf u}) := P({\bf x}) \\ &\leq \sum_{\bf x} P({\bf x}) \log \big[\int_{\bf u\in [0,1]^D} q({\bf x} + {\bf u}) d{{\bf u}\big]} && \text{by Jensen's inequality} \\ &= \sum_{\bf x} P({\bf x}) \log Q({\bf x}) \\ &= -H(P,Q) \\ \tag{15} \end{align*}

where we define \(Q({\bf x}) = \int_{\bf u\in [0,1]^D} q({\bf x} + {\bf u}) d{\bf u}\). A couple of points on the derivation:

  • When doing the change of variable we implicitly are using the determinant of the Jacobian but in this case it's just 1.

  • The integral of \(\int_x dx = \sum_x\) since \(x\) is discrete.

  • We defined the noisy data density \(p({\bf x} + {\bf u})\) as \(P({\bf x})\) since that's the most logical way to define the density (the density should sum to 1 as expected).

When training a generative model, we'll usually try to minimize \(H(p,q) \geq H(P,Q)\), which sets an upper bound on the cross entropy of our original discrete data distribution \(P({\bf x})\) and our "discretized" continuous density distribution \(Q({\bf x})\). In a similar way to the discrete models above, we can then also be interpret the loss as the average number of bits to encode the image (or "bits per pixel" if dividing through by the total number of pixels).

On a final note, as before, we would typically assume IID data for our \(N\) training data points \({\bf x}\) and that we would draw \(M\) uniform noise samples for each data point, which leads us to:

\begin{align*} H(p,q) &= -\int p({\bf y})\log q({\bf y})d{\bf y} \\ &= -\sum_{i=1}^N P({\bf x_i}) \int_{\bf u\in [0,1]^D} \log q({\bf x_i} + {\bf u})d{{\bf u}} \\ &= -\sum_{i=1}^N \frac{1}{N} \int_{\bf u\in [0,1]^D} \log q({\bf x_i} + {\bf u})d{{\bf u}} \\ &\approx -\sum_{i=1}^N \frac{1}{N} \big[\frac{1}{M}\sum_{j=1}^M \log q({\bf x_i} + {\bf u_j})\big] \\ &= -\frac{1}{NM} \sum_{i=1}^N \sum_{j=1}^M \log q({\bf x_i} + {\bf u_j}) \\ \tag{16} \end{align*}

So after all, our loss function is still just the average log-likelihood with the addition that we're averaging over \(M\) uniform noise samples per data point. Usually we just draw a single uniform noise sample per data point, per epoch, which, given enough iterations, will also converge to the same value. Note the caveat is that this "trick" is just estimating the upper bound of "bits per pixel" because of our use of a continuous density output of our model.


A bit of a digression but, as usual, I came across it while investigating a topic but realized I didn't fully understand a supporting idea. I really like some of these side posts because they explain in fundamental terms (i.e. probability), ideas that are taken for granted in many papers. What's nice is that these explanations can help all those unwitting souls (like myself) who want a deeper understanding of the throwaway lines you commonly see in ML papers. Hope you liked it!

Further Reading


Imagine you have a single data point \(x=0\) and you are trying to learn a normal distribution. What's the best fit normal distribution? Well definitely you want the mean to be 0 but what about the variance? Of course, you want it to be infinitesimally small which maximizes the density, but this means in the limit the density tends to infinity causing the degenerate scenario described above.

Hi, I'm Brian Keng. This is the place where I write about all things technical.

Twitter: @bjlkeng

Signup for Email Blog Posts