.. title: Autoregressive Autoencoders .. slug: autoregressive-autoencoders .. date: 2017-10-14 10:02:15 UTC-04:00 .. tags: autoencoders, autoregressive, generative models, MADE, MNIST, mathjax .. category: .. link: .. description: A write up on Masked Autoencoder for Distribution Estimation (MADE). .. type: text .. |br| raw:: html
.. |H2| raw:: html

### .. |H2e| raw:: html

.. |H3| raw:: html

#### .. |H3e| raw:: html

.. |center| raw:: html
.. |centere| raw:: html
You might think that I'd be bored with autoencoders by now but I still find them extremely interesting! In this post, I'm going to be explaining a cute little idea that I came across in the paper MADE: Masked Autoencoder for Distribution Estimation _. Traditional autoencoders are great because they can perform unsupervised learning by mapping an input to a latent representation. However, one drawback is that they don't have a solid probabilistic basis (of course there are other variants of autoencoders that do, see previous posts here __, here __, and here __). By using what the authors define as the *autoregressive property*, we can transform the traditional autoencoder approach into a fully probabilistic model with very little modification! As usual, I'll provide some intuition, math and an implementation. .. TEASER_END |h2| Vanilla Autoencoders |h2e| The basic autoencoder _ is a pretty simple idea. Our primary goal is take an input sample :math:x and transform it to some latent dimension :math:z (*encoder*), which hopefully is a good representation of the original data. As usual, we need to ask ourselves: what makes a good representation? An autoencoder's answer: "*A good representation is one where you can reconstruct the original input!*". The process of transforming the latent dimension :math:z back to a reconstructed version of the input :math:\hat{x} is called the *decoder*. It's an "autoencoder" because it's using the same value :math:x value on the input and output. Figure 1 shows a picture of what this looks like. .. figure:: /images/autoencoder_structure.png :width: 400px :alt: Vanilla Autoencoder :align: center Figure 1: Vanilla Autoencoder (source: Wikipedia _) From Figure 1, we typically will use a neural network as the encoder and a different (usually similar) neural network as the decoder. Additionally, we'll typically put a sensible loss function on the output to ensure :math:x and :math:\hat{x} are as close as possible: .. math:: \mathcal{L_{\text{binary}}}({\bf x}) &= \sum_{i=1}^D -x_i\log \hat{x}_i - (1-x_i)\log(1-\hat{x_i}) \tag{1} \\ \mathcal{L_{\text{real}}}({\bf x}) &= \sum_{i=1}^D (x_i - \hat{x}_i)^2 \tag{2} Here we assume that our data point :math:{\bf x} has :math:D dimensions. The loss function we use will depend on the form of the data. For binary data, we'll use cross entropy and for real-valued data we'll use the mean squared error. These correspond to modelling :math:x as a Bernoulli and Gaussian respectively (see the box). .. admonition:: Negative Log-Likelihoods (NLL) and Loss Functions The loss functions we typically use in training machine learning models are usually derived by an assumption on the probability distribution of each data point (typically assuming identically, independently distributed (IID) data). It just doesn't look that way because we typically use the negative log-likelihood as the loss function. We can do this because we're usually just looking for a point estimate (i.e. optimizing) so we don't need to worry about the entire distribution, just a single point that gives us the highest probability. For example, if our data is binary, then we can model it as a Bernoulli __ with parameter :math:p on the interval :math:(0,1). The probability of seeing a given 0/1 :math:x value is then: .. math:: P(x) = p^x(1-p)^{(1-x)} \tag{3} If we take the logarithm and negate it, we get the binary cross entropy loss function: .. math:: \mathcal{L_{\text{binary}}}(x) = -x\log p - (1-x)\log(1-p) \tag{4} This is precisely the expression from Equation 1, except we replace :math:x=x_i and :math:p=\hat{x_i}, where the former is the observed data and latter is the estimate of the parameters that our model gives. Similarly, we can do the same trick with a normal distribution __. Given an observed real-valued data point :math:x, the probability density for parameters :math:\mu, \sigma^2 is given by: .. math:: p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \tag{5} Taking the negative logarithm of this function, we get: .. math:: -\log p(x) = \frac{1}{2}\log(2\pi \sigma^2) + \frac{1}{2\sigma^2} (x-\mu)^2 \tag{6} Now if we assume that the variance is the same fixed value for all our data points, then the only parameter we're optimizing for is :math:\mu. So adding and multiplying a bunch of constants to our main expression doesn't change the optimal (highest probability) point so we can just simplify it (when optimizing) and still get the same point solution: .. math:: \underset{\mu}{\operatorname{argmax}} -\log p(x) = \underset{\mu}{\operatorname{argmax}} \mathcal{L_{\text{real}}}(x) = \underset{\mu}{\operatorname{argmax}} (x-\mu)^2 \\ \tag{7} Here our observation is :math:x and our model would produce an estimate of the parameter :math:\mu i.e. :math:\hat{x} in this case. I have some more details on this in one of my previous posts on regularization __. |h3| Losing Your Identity |h3e| Now this is all well and good but an astute observer will notice that unless we put some additional constraints, our autoencoder can just set :math:z=x (i.e. the identity function) and generate a perfect reconstruction. What better representation for a reconstruction than *exactly* the original data? This is not desirable because we originally wanted to find a good latent representation for :math:z, not just regurgitate :math:x! We can easily solve this though by making it difficult to learn just the identity function. The easiest method is to just make the dimensions of :math:z smaller than :math:x. For example, if your image has 900 pixels (30 x 30) then make the dimensions of :math:z, say 100. In this way, you're "forcing" the autoencoder to learn a more compact representation. Another method used in *denoising autoencoders* is to artificially introduce noise on the input :math:x' = \text{noise}(x) (e.g. Gaussian noise) but still compare the output of the decoder with the clean value of :math:x. The intuition here is that a good representation is robust to any noise that you might give it. Again, this prevents the autoencoder from just learning the identify mapping (because your input is not the same as your output anymore). In both cases, you will eventually end up with a pretty good latent representation of :math:x that can be used in all sorts of applications such as semi-supervised learning __. |h3| A Not-So-Helpful Probabilistic Interpretation |h3e| Although vanilla autoencoders can do pretty well in learning a latent representation of the data in an unsupervised manner, they don't have a useful probabilistic interpretation. We put a loss function on the outputs of the autoencoder in Equation 1 and 2 but that only leads to a trivial probabilistic distribution! Let me explain. Ideally, we would like the unsupervised autoencoder to learn the distribution of the data. That is, we would like to be able to approximate the *marginal* probability distribution :math:P({\bf x}), which let's us do a whole bunch of useful and interesting things (e.g. sampling). Our autoencoder, however, does not model the marginal distribution, it models the *conditional* distribution given inputs :math:\bf x_i (subscript denoting data point :math:i). Thus, our network is actually modelling :math:P({\bf x} | {\bf x_i}) -- a conditional probability distribution that is approximately centred on :math:x_i, given that same data point as input. This is weird in two ways. First, this conditional distribution is kind of fictional -- :math:x_i is a single data point, so it doesn't really make sense to talk about a distribution centred on it. While you could argue it might have some relation to the other data points, the chances that the Bernoulli or Gaussian in Equation 1 and 2 model that correctly are pretty slim. Second, the bigger problem is that you are giving :math:x_i as input to try to generate a distribution centred on :math:x_i! You don't need a neural network to do that! You can just pick a variance and slap an independent normal distribution on each output (for the continuous case). As we can see, trying to interpret this vanilla autoencoder network using the lens of probability is not very useful. The more typical way you might want to actually use an autoencoder network is by just using the decoder network. In this setting, you have a distribution conditional on the latent variables :math:P({\bf x}|{\bf z}). This is pretty much the setup we have in variational autoencoders __ (with some extra details outlined in the linked post). However, with vanilla autoencoders we have no probabilistic interpretation of the latent variable :math:z, thus still do not have a useful probabilistic interpretation. For vanilla autoencoders, we started with some neural network and then tried to apply some sort of probabilistic interpretation that didn't quite work out. Why not do it the other way around: start with a probabilistic model and then figure out how to use neural networks to help you add more capacity and scale it? |h2| Autoregressive Autoencoders |h2e| So vanilla autoencoders don't quite get us to a proper probability distribution but is there a way to modify them to get us there? Let's review the product rule __: .. math:: p({\bf x}) = \prod_{i=1}^{D} p(x_i | {\bf x}_{