.. title: Autoregressive Autoencoders
.. slug: autoregressive-autoencoders
.. date: 2017-10-14 10:02:15 UTC-04:00
.. tags: autoencoders, autoregressive, generative models, MADE, MNIST, mathjax
.. category:
.. link:
.. description: A write up on Masked Autoencoder for Distribution Estimation (MADE).
.. type: text
.. |br| raw:: html

.. |H2| raw:: html

###
.. |H2e| raw:: html

.. |H3| raw:: html
####
.. |H3e| raw:: html

.. |center| raw:: html
.. |centere| raw:: html
You might think that I'd be bored with autoencoders by now but I still
find them extremely interesting! In this post, I'm going to be explaining
a cute little idea that I came across in the paper `MADE: Masked Autoencoder
for Distribution Estimation `_.
Traditional autoencoders are great because they can perform unsupervised
learning by mapping an input to a latent representation. However, one
drawback is that they don't have a solid probabilistic basis
(of course there are other variants of autoencoders that do, see previous posts
`here `__,
`here `__, and
`here `__).
By using what the authors define as the *autoregressive property*, we can
transform the traditional autoencoder approach into a fully probabilistic model
with very little modification! As usual, I'll provide some intuition, math and
an implementation.
.. TEASER_END
|h2| Vanilla Autoencoders |h2e|
The basic `autoencoder `_
is a pretty simple idea. Our primary goal is take an input sample
:math:`x` and transform it to some latent dimension :math:`z` (*encoder*),
which hopefully is a good representation of the original data. As
usual, we need to ask ourselves: what makes a good representation? An
autoencoder's answer: "*A good representation is one where you can reconstruct
the original input!*". The process of transforming the latent
dimension :math:`z` back to a reconstructed version of the input
:math:`\hat{x}` is called the *decoder*. It's an "autoencoder" because
it's using the same value :math:`x` value on the input and output. Figure 1
shows a picture of what this looks like.
.. figure:: /images/autoencoder_structure.png
:width: 400px
:alt: Vanilla Autoencoder
:align: center
Figure 1: Vanilla Autoencoder (source: `Wikipedia `_)
From Figure 1, we typically will use a neural network as the encoder and
a different (usually similar) neural network as the decoder. Additionally,
we'll typically put a sensible loss function on the output to ensure :math:`x`
and :math:`\hat{x}` are as close as possible:
.. math::
\mathcal{L_{\text{binary}}}({\bf x}) &= \sum_{i=1}^D -x_i\log \hat{x}_i - (1-x_i)\log(1-\hat{x_i}) \tag{1} \\
\mathcal{L_{\text{real}}}({\bf x}) &= \sum_{i=1}^D (x_i - \hat{x}_i)^2 \tag{2}
Here we assume that our data point :math:`{\bf x}` has :math:`D` dimensions.
The loss function we use will depend on the form of the data. For binary data,
we'll use cross entropy and for real-valued data we'll use the mean squared
error. These correspond to modelling :math:`x` as a Bernoulli and Gaussian
respectively (see the box).
.. admonition:: Negative Log-Likelihoods (NLL) and Loss Functions
The loss functions we typically use in training machine learning models are
usually derived by an assumption on the probability distribution of each
data point (typically assuming identically, independently distributed (IID)
data). It just doesn't look that way because we typically use the negative
log-likelihood as the loss function. We can do this because we're usually
just looking for a point estimate (i.e. optimizing) so we don't need to
worry about the entire distribution, just a single point that gives us the
highest probability.
For example, if our data is binary, then we can model it as a
`Bernoulli `__
with parameter :math:`p` on the interval :math:`(0,1)`. The probability
of seeing a given 0/1 :math:`x` value is then:
.. math::
P(x) = p^x(1-p)^{(1-x)} \tag{3}
If we take the logarithm and negate it, we get the binary cross entropy
loss function:
.. math::
\mathcal{L_{\text{binary}}}(x) = -x\log p - (1-x)\log(1-p) \tag{4}
This is precisely the expression from Equation 1, except we replace
:math:`x=x_i` and :math:`p=\hat{x_i}`, where the former is the observed
data and latter is the estimate of the parameters that our model gives.
Similarly, we can do the same trick with a
`normal distribution `__.
Given an observed real-valued data point :math:`x`, the probability density
for parameters :math:`\mu, \sigma^2` is given by:
.. math::
p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \tag{5}
Taking the negative logarithm of this function, we get:
.. math::
-\log p(x) =
\frac{1}{2}\log(2\pi \sigma^2) + \frac{1}{2\sigma^2} (x-\mu)^2 \tag{6}
Now if we assume that the variance is the same fixed value for all our data
points, then the only parameter we're optimizing for is :math:`\mu`. So
adding and multiplying a bunch of constants to our main expression
doesn't change the optimal (highest probability) point so we can just
simplify it (when optimizing) and still get the same point solution:
.. math::
\underset{\mu}{\operatorname{argmax}} -\log p(x) =
\underset{\mu}{\operatorname{argmax}} \mathcal{L_{\text{real}}}(x) =
\underset{\mu}{\operatorname{argmax}} (x-\mu)^2
\\ \tag{7}
Here our observation is :math:`x` and our model would produce an
estimate of the parameter :math:`\mu` i.e. :math:`\hat{x}` in this case. I
have some more details on this in one of my previous posts on
`regularization `__.
|h3| Losing Your Identity |h3e|
Now this is all well and good but an astute observer will notice that unless we
put some additional constraints, our autoencoder can just set :math:`z=x` (i.e.
the identity function) and generate a perfect reconstruction. What better
representation for a reconstruction than *exactly* the original data? This
is not desirable because we originally wanted to find a good latent representation
for :math:`z`, not just regurgitate :math:`x`! We can easily solve this though
by making it difficult to learn just the identity function.
The easiest method is to just make the dimensions of :math:`z` smaller than
:math:`x`. For example, if your image has 900 pixels (30 x 30) then make the
dimensions of :math:`z`, say 100. In this way, you're "forcing" the
autoencoder to learn a more compact representation.
Another method used in *denoising autoencoders* is to artificially introduce
noise on the input :math:`x' = \text{noise}(x)` (e.g. Gaussian noise) but still
compare the output of the decoder with the clean value of :math:`x`. The
intuition here is that a good representation is robust to any noise that you
might give it. Again, this prevents the autoencoder from just learning the
identify mapping (because your input is not the same as your output anymore).
In both cases, you will eventually end up with a pretty good latent representation
of :math:`x` that can be used in all sorts of applications such as
`semi-supervised learning `__.
|h3| A Not-So-Helpful Probabilistic Interpretation |h3e|
Although vanilla autoencoders can do pretty well in learning a latent
representation of the data in an unsupervised manner, they don't have a useful
probabilistic interpretation. We put a loss function on the outputs of the
autoencoder in Equation 1 and 2 but that only leads to a trivial probabilistic
distribution! Let me explain.
Ideally, we would like the unsupervised autoencoder to learn the distribution
of the data. That is, we would like to be able to approximate the *marginal*
probability distribution :math:`P({\bf x})`, which let's us do a whole bunch
of useful and interesting things (e.g. sampling). Our autoencoder, however,
does not model the marginal distribution, it models the *conditional*
distribution given inputs :math:`\bf x_i` (subscript denoting data point
:math:`i`). Thus, our network is actually modelling :math:`P({\bf x} | {\bf
x_i})` -- a conditional probability distribution that is approximately centred
on :math:`x_i`, given that same data point as input.
This is weird in two ways. First, this conditional distribution is kind of
fictional -- :math:`x_i` is a single data point, so it doesn't really make
sense to talk about a distribution centred on it. While you could argue it
might have some relation to the other data points, the chances that the Bernoulli
or Gaussian in Equation 1 and 2 model that correctly are pretty slim.
Second, the bigger problem is that you are giving :math:`x_i` as input to
try to generate a distribution centred on :math:`x_i`! You don't need a neural
network to do that! You can just pick a variance and slap an independent normal
distribution on each output (for the continuous case). As we can see, trying
to interpret this vanilla autoencoder network using the lens of probability is not
very useful.
The more typical way you might want to actually use an autoencoder network is
by just using the decoder network. In this setting, you have a distribution
conditional on the latent variables :math:`P({\bf x}|{\bf z})`. This is
pretty much the setup we have in `variational autoencoders
`__ (with some extra details outlined in
the linked post). However, with vanilla autoencoders we have no probabilistic
interpretation of the latent variable :math:`z`, thus still do not have a useful
probabilistic interpretation.
For vanilla autoencoders, we started with some neural network and then tried to
apply some sort of probabilistic interpretation that didn't quite work out. Why
not do it the other way around: start with a probabilistic model and then
figure out how to use neural networks to help you add more capacity and scale
it?
|h2| Autoregressive Autoencoders |h2e|
So vanilla autoencoders don't quite get us to a proper probability distribution
but is there a way to modify them to get us there? Let's review the
`product rule `__:
.. math::
p({\bf x}) = \prod_{i=1}^{D} p(x_i | {\bf x}_{* m^{L}(k) \\
0 \text{ otherwise}
\end{array}
\right. \\ \tag{11}
which replaces the less than equal with just an equal. This is important
because the first node should not depend on any other ones so it should not
have any connections (will only have the bias connection), and the last node
can have connections (recursively) to every other node except its respective
input.
Finally, one last topic to discuss is how to assign :math:`m^l(k)`. It doesn't
really matter too much as long as you have enough connections for each index.
The paper did a natural thing and just sampled from a uniform distribution
with range :math:`[1, D-1]`. Why only up to :math:`D-1`? Recall, we should
never assign index :math:`D` because it will never be used so there's no use
in connecting anything to :math:`D` (nothing can ever depend on the
:math:`D^{\text{th}}` input). Figure 2 (from the original paper) shows this
whole process pictorially.
.. figure:: /images/made_mask.png
:width: 450px
:alt: MADE Masks
:align: center
Figure 2: MADE Masks (Source: `[1] `__)
A few things to notice:
* Output 1 is not connected to anything. It will just be estimated with a
single constant parameter derived from the bias node.
* Input 3 is not connected to anything because no node should depend on it
(autoregressive property).
* :math:`m^l(k)` are more or less assigned randomly.
* If you trace back from output to input, you will see that the autoregressive
property is maintained.
So then implementing MADE is as simple as providing a weight mask
and doing an extra element-wise product. Pretty simple, right?
|h3| Ordering Inputs, Masks, and Direct Connections |h3e|
A few other minor topics that can improve the performance of the MADE. The
first is the ordering of the inputs. We've been taking about "Input 1, 2, 3,
..." but usually there is no natural ordering of the inputs. We can
arbitrarily pick any ordering that we want just by shuffling :math:`{\bf m^0}`,
the selection layer for the input. This can even be performed at each
mini-batch to get an "average" over many different models.
The next idea is also very similar, instead of just resampling the input
selection, resample all :math:`{\bf m^L}` selections. In the paper, they
mention the best results are having a fixed number of configurations for these
selections (and their corresponding masks) and rotating through them in the
mini-batch training.
The last idea is just to add a direct connection path from input to output
like so:
.. math::
{\hat{\bf x}} = \text{sigm}\big({\bf c} + {\bf (V \odot M^V)h(x)}\big)
+ \big({\bf A} \odot {\bf M^A}\big){\bf x} \tag{12}
where :math:`{\bf A}` is the weight matrix that directly connects inputs to outputs,
and :math:`{\bf M^A}` is the corresponding mask matrix that follows the
autoregressive property.
|h3| Generating New Samples |h3e|
One final idea that isn't explicitly mentioned in the paper is how to generate
new samples. Remember, we now have a fully generative probabilistic model for
our autoencoder. It turns out it's quite easy but a bit slow. The main idea
(for binary data):
1. Randomly generate vector :math:`{\bf x}`, set :math:`i=1`.
2. Feed :math:`{\bf x}` into autoencoder and generate outputs
:math:`\hat{\bf x}` for the network, set :math:`p=\hat{x_i}`.
3. Sample from a Bernoulli distribution with parameter :math:`p`, set
input :math:`x_{i}=\text{Bernoulli}(p)`.
4. Increment :math:`i` and repeat steps 2-4 until `i > D`.
Basically, we're iteratively calculating :math:`p(x_i|{\bf x_{**`__).
My implementation is a lot simpler than the one used in the paper. I used
Keras and created a custom "MADE" layer that took as input the number of layers,
number of hidden units per layer, whether or not to randomize the input
selection, as well as standard stuff like dropout and activation function.
I didn't implement any of the randomized masks for minibatchs because it was
a bit of a pain. I did implement the direct connection though.
*(As an aside: I'm really a big fan of higher-level frameworks like Keras,
it's quite wonderful. The main reason is that for most things I have the nice
Keras frontend, and then occasionally I can dip down into the underlying
primitives when needed via the Keras "backend". I suspect when I eventually
get around to playing with RNNs it's not going to be as wonderful but for now
I quite like it.)*
I was able to generate some new digits that are not very pretty, shown
in Figure 3.
.. figure:: /images/mnist-made.png
:width: 400px
:alt: Generated MNIST images using Autoregressive Autoencoder
:align: center
Figure 3: Generated MNIST images using Autoregressive Autoencoder
It's a bit hard to make out any numbers here. If you squint hard enough, you
can make out some "4"s, "3"s, "6"s, maybe some "9"s? The ones in the paper look
a lot better (although still not perfect, there were definitely some that were
hard to make out).
The other thing is that I didn't use their exact version of binarized MNIST,
I just took the one from Keras and did a `round()` on each pixel. This might
also explain why I was unable to get as good of a negative log-likelihood as
them. In the paper they report values :math:`< 90` (even with a single mask)
but the lowest I was able to get on my test set was around :math:`99`, and that
was after a bunch of tries tweaking the batch and learning rate (more typical
was around :math:`120`). It could be that their test set was easier, or the
fact that they did some hyper-parameter tuning for each experiment, whereas I
just did some trial and error tuning.
|h3| Implementation Notes |h3e|
Here are some random notes that I came across when building this MADE:
* Adding a direct (auto-regressive) connection between inputs and outputs
seemed to make a huge difference (150 vs. < 100 loss). For me, this
basically was the make-or-break piece for implementing a MADE. It's funny that
it's just a throw-away paragraph in the actual paper. Probably because the
idea was from an earlier paper in 2000 and not the main contribution of the
paper. For some things, you really have to implement it to understand the
important parts, papers don't tell the whole story!
* I had to be quite careful when coding up layers since getting the indexes for
selection exactly right is important. I had a few false starts because I
mixed up the indexes. When using the high-level Keras API, there's not much
of this detailed work, but when implementing your own layers it's important!
* I tried a random ordering (just a single one for the entire training, not
one per batch) and it didn't really seem to do much.
* In their actual implementation, they also add dropout to all their layers. I
added it too but didn't play around with it much except to try to tune it to
get a lower NLL. One curious thing I found out was about using the
`set_learning_phase() `__ API. When implementing
dropout, I basically just took the code from the dropout layer and inserted
into my custom layer. However, I kept getting an error, it turns out that
I had to use `set_learning_phase(1)` during training, and
`set_learning_phase(0)` during prediction because the Keras dropout
implementation uses `in_train_phase(, )`, which
switches between two behaviours for training/testing based on the status of
this bit. For some reason when using the regular dropout layer you don't
have to do this but when doing it in a custom layer you do? I suspect I
missed something in my custom layer that happens in the dropout layer.
|h2| Conclusion |h2e|
So yet *another* post on autoencoders, I can't seem to get enough of them!
Actually I still find them quite fascinating, which is why I'm following this
line of research about fully probabilistic generative models. There's still at
least one or two more papers in this area that I'm really excited to dig into
(at which point I'll have approached the latest published work), so expect more
to come!
|h2| Further Reading |h2e|
* Previous posts: `Variational Autoencoders `__, `A Variational Autoencoder on the SVHN dataset `__, `Semi-supervised Learning with Variational Autoencoders `__
* My implementation on Github: `notebook `__
* [1] "MADE: Masked Autoencoder for Distribution Estimation", Germain, Gregor, Murray, Larochelle, `ICML 2015 `_
* Wikipedia: `Autoencoder `_
* Github code for "MADE: Masked Autoencoder for Distribution Estimation", https://github.com/mgermain/MADE
*