.. title: A Probabilistic Interpretation of Regularization
.. slug: probabilistic-interpretation-of-regularization
.. date: 2016-08-29 8:52:33 UTC-04:00
.. tags: probability, regularization, Bayesian, mathjax
.. category:
.. link:
.. description: A look at regularization through the lens of probability.
.. type: text
.. |br| raw:: html

.. |H2| raw:: html

###
.. |H2e| raw:: html

.. |H3| raw:: html
####
.. |H3e| raw:: html

.. |center| raw:: html
.. |centere| raw:: html
This post is going to look at a probabilistic (Bayesian) interpretation of
regularization. We'll take a look at both L1 and L2 regularization in the
context of ordinary linear regression. The discussion will start off
with a quick introduction to regularization, followed by a back-to-basics
explanation starting with the maximum likelihood estimate (MLE), then on to the
maximum a posteriori estimate (MAP), and finally playing around with priors to
end up with L1 and L2 regularization.
.. TEASER_END
|h2| Regularization |h2e|
`Regularization `_
is the process of introducing additional information in order to solve
ill-posed problems or prevent overfitting. A trivial example is when trying to
fit a simple linear regression but you only have one point. In this case, you can't
estimate both the slope and intercept (you need at least two points) so any MLE
estimate (which *only* uses the data) will be ill-formed. Instead, if you
provide some "additional information" (i.e. prior information [1]_), you can
get a much more reasonable estimate.
To make things a bit more concrete, let's talk about things in the context of a
`ordinary linear regression `_.
Recall from my previous post on
`linear regression `_
(Equation 11 in that post)
that the maximum likelihood estimate for ordinary linear regression is given by:
.. math::
{\bf \hat{\beta}_{\text{MLE}}}
&= \arg\min_{\bf \beta} \sum_{i=1}^{n} (y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 \\
&= \arg\min_{\bf \beta} \sum_{i=1}^{n} (y_i-\hat{y_i})^2
\tag{1}
The estimate is quite intuitive: pick the coefficients (:math:`\beta_j`) that
minimizes the squared error between the observed values (:math:`y_i`) and those
generated by our linear model (:math:`\hat{y_i}`).
In a similar vein as above, consider what happens when we only have one data
point :math:`(y_0, {\bf x_0})` but more than one coefficient. There are any
number of possible "lines" or equivalently coefficients that we could draw to
minimize Equation 1. Thinking back to high school math, this is analogous to
estimating the slope and intercept for a line but with only one point.
Definitely a problem they didn't teach you in high school.
I'm using examples where we don't have enough data
but there could other types of issues such as
`colinearity `_ that may
not outright prevent fitting of the model but will probably produce an
unreasonable estimate.
Two common schemes for regularization add a simple modification to Equation 1:
.. math::
{\bf \hat{\beta}_{\text{L1}}}
= \arg\min_{\bf \beta} \big( \sum_{i=1}^{n} (y_i- (\beta_0 + \beta_1 x_{i, 1} + ... + \beta_p x_{i,p}))^2
+ \lambda \sum_{j=0}^{p} | \beta_j | \big)
\\
\tag{2}
{\bf \hat{\beta}_{\text{L2}}}
= \arg\min_{\bf \beta} \big( \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2
+ \lambda \sum_{j=0}^{p} | \beta_j | ^2 \big)
\tag{3}
`L1 regularization `_
(also known as LASSO in the context of linear regression) promotes sparsity of coefficients.
Sparsity translates to some coefficients having values, while others are zero
(or closer to zero). This can be seen as a form of feature selection.
`L2 regularization
`_
(also known as ridge regression in the context of linear regression and
generally as Tikhonov regularization) promotes smaller coefficients (i.e. no one
coefficient should be too large). This type of regularization is pretty common
and typically will help in producing reasonable estimates. It also has a simple probabilistic
interpretation (at least in the context of linear regression) which we will see below.
(You can skip the next two sections if you're already familiar with the basics
of MLE and MAP estimates.)
|h2| The Likelihood Function |h2e|
Recall Equation 1 can be derived from the likelihood function (without
:math:`\log(\cdot)`) for ordinary linear regression:
.. math::
\mathcal{L({\bf \beta}|{\bf y})} &:= P({\bf y} | {\bf \beta}) \\
&= \prod_{i=1}^{n} P_Y(y_i|{\bf \beta}, \sigma^2) \\
&= \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}}
\tag{4}
where :math:`{\bf y}` are our observed data points (:math:`y_1, \ldots, y_N`) and
:math:`P_Y(y_i|\mu, \sigma^2)` is the probability of observing the data point :math:`y_i`.
The implicit assumption in linear regression is that the data points are normally
distributed about the regression line (see my past post on
`linear regression `_ for more on this).
Classical statistics focuses on maximizing this likelihood function,
which *usually* provides a pretty good estimate -- except when it doesn't like
in the case of "small data". However, looking at the problem from a more
probabilistic point of view (i.e. Bayesian), we don't just want to maximize
the likelihood function but rather the posterior probability. Let's see how
that works.
|h2| The Posterior |h2e|
We'll start by reviewing
`Bayes Theorem `_
where we usually denote the parameters we're trying to estimate by
:math:`\theta` and the data as :math:`y`:
.. math::
P(\theta | y) &= \frac{P(y | \theta) P(\theta)}{P(y)} \\
\text{posterior} &= \frac{\text{likelihood} \cdot \text{prior}}{\text{evidence}}
\tag{5}
In Bayesian inference, we're primarily concerned with the posterior: "the
probability of the parameters given the data". Put in another way, we're looking
to estimate the probability distribution of the parameters (:math:`\theta`)
given the data we have observed (:math:`y`). Contrast this with classical methods
which instead try to find the best parameters to maximize likelihood: the
probability of observing data (:math:`y`) given a different values of the
parameters. Definitely a subtle difference but I think most would agree the Bayesian
interpretation is much more natural [2]_.
Looking at Equation 5 in more detail, we already know how to compute the likelihood but
the two new parts are the prior and the evidence. This is where proponents of
frequentist statistics usually have a philosophical dilemma. The prior is actually
something we (the modeler) explicitly choose that is **not** based on the data [3]_
(:math:`y`) \*gasp\*!
Without getting into a whole philosophical spiel, adding some additional prior
information is exactly what we want in certain situations! For example when
we don't have enough data, we probably have some idea about what is reasonable
given our knowledge of the problem. This prior allows us to encode this
knowledge. Even in cases where we don't explicitly have this problem, we can
choose a `"weak" prior `_
which will only bias the result slightly from the MLE estimate. In cases where
we have lots of the data, the likelihood dominates Equation 5 anyways so the
result will be similar in these cases.
From Equation 5, a full Bayesian analysis would look at the distribution of the
parameters (:math:`\theta`). However, most people will settle for something a
bit less involved: finding the maximum of the posterior (which turns out to be
an easier problem in most cases). This is known as the
`maximum a posteriori probability estimate `_,
usually abbreviated by MAP. This simplifies the analysis and in particular
allows us to ignore the evidence (:math:`P(y)`), which is constant relative to
the parameters we're trying to estimate (:math:`\theta`).
Formalizing the MAP estimate, we can write it as:
.. math::
{\bf \hat{\theta}_{\text{MAP}}} &= \arg\max_{\bf \theta} P(\theta | y) \\
&= \arg\max_{\bf \theta} \frac{P(y | \theta) P(\theta)}{P(y)} \\
&= \arg\max_{\bf \theta} P(y | \theta) P(\theta) \\
&= \arg\max_{\bf \theta} \log(P(y | \theta) P(\theta)) \\
&= \arg\max_{\bf \theta} \log P(y | \theta) + \log P(\theta)
\tag{6}
Notice that we can get rid of the evidence term (:math:`P(y)`) because it's
constant with respect to the maximization and also take the :math:`\log` of the
inner function because it's monotonically increasing. Contrast this with the
MLE estimate:
.. math::
{\bf \hat{\theta}_{\text{MLE}}} = \arg\max_{\bf \theta} \log P(y | \theta)
\tag{7}
|h2| Selecting Priors for Linear Regression |h2e|
The main idea is to select Bayesian priors on the coefficients of linear
regression that get us to L1 and L2 regularization (Equation 2 and 3). Let's
see how this works.
|h3| Normally Distributed Priors |h3e|
We'll start with our good old friend the normal distribution and place a
zero-mean normally distributed prior on *each* :math:`\beta_i` value, all with
identical variance :math:`\tau^2`. From Equation 6 and filling in the
likelihood function from Equation 4 and our prior:
.. math::
&\arg\max_{\bf \beta} \Big[ \log \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}}
+ \log \prod_{j=0}^{p} \frac{1}{\tau\sqrt{2\pi}}e^{-\frac{\beta_j^2}{2\tau^2}} \Big] \\
&= \arg\max_{\bf \beta} \Big[- \sum_{i=1}^{n} {\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}}
- \sum_{j=0}^{p} {\frac{\beta_j^2}{2\tau^2}} \Big]\\
&= \arg\min_{\bf \beta} \frac{1}{2\sigma^2} \big[ \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2
+ \frac{\sigma^2}{\tau^2} \sum_{j=0}^{p} \beta_j^2 \big] \\
&= \arg\min_{\bf \beta} \big[ \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 + \lambda \sum_{j=0}^{p} \beta_j^2 \big]
\tag{8}
Notice that we dropped many of the constants (with respect to :math:`\beta`)
and factored a bit to simplify the expression.
You can see this is the same expression as Equation 3 (L2 Regularization)
with :math:`\lambda = \sigma^2/\tau^2` (remember :math:`\sigma` is
assumed to be constant in ordinary linear regression, and we get to pick
:math:`\tau` for our prior). We can adjust the amount of regularization we
want by changing :math:`\lambda`. Equivalently, we can adjust how much we want
to weight the priors carry on the coefficients (:math:`\beta`). If we have a
very small variance (large :math:`\lambda`) then the coefficients will be very
close to 0; if we have a large variance (small :math:`\lambda`) then the
coefficients will not be affected much (similar to as if we didn't have any
regularization).
|h3| Laplacean Priors |h3e|
Let's first review the density of the `Laplace distribution
`_ (something that's
usually not introduced in beginner probability classes):
.. math::
Laplace(\mu, b) = \frac{1}{2b} e^{-\frac{|x-\mu|}{b}}
This is sometimes called the "double exponential" distribution because it looks
like two exponential distributions placed back to back (appropriately scaled
with a location parameter). It's also quite similar to our Gaussian in form,
perhaps you can see how we can get to L1 regularization already?
Starting with a zero-mean Laplacean prior on all the coefficients like we did
in the previous subsection:
.. math::
&\arg\max_{\bf \beta} \Big[ \log \prod_{i=1}^{n} \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}}
+ \log \prod_{j=0}^{p} \frac{1}{2b}e^{-\frac{|\beta_j|}{b}} \Big] \\
&= \arg\max_{\bf \beta} \Big[- \sum_{i=1}^{n} {\frac{(y_i- (\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2}{2\sigma^2}}
- \sum_{j=0}^{p} {\frac{|\beta_j|}{b}} \Big]\\
&= \arg\min_{\bf \beta} \frac{1}{2\sigma^2} \big[ \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2
+ \frac{2\sigma^2}{b} \sum_{j=0}^{p} |\beta_j| \big] \\
&= \arg\min_{\bf \beta} \big[ \sum_{i=1}^{n} (y_i-(\beta_0 + \beta_1 x_{i,1} + ... + \beta_p x_{i,p}))^2 + \lambda \sum_{j=0}^{p} |\beta_j| \big]
\tag{9}
Again we can see that Equation 9 contains the same expression as L1
Regularization in Equation 2.
The Laplacean prior has a slightly different effect compared to L2
regularization. Instead of preventing any of the coefficients from being too
large (due to the squaring), L1 promotes sparsity. That is, zeroing out some
of the coefficients. This makes some sense if you look at the density of a
Laplacean prior where there is a sharp increase in the density at its mean.
Another way to intuitively see this is to compare two solutions [4]_. Let's
imagine we are estimating two coefficients in a regression. In L2
regularization, the solution :math:`{\bf \beta} = (1, 0)` has the same weight
as :math:`{\bf \beta} = (\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})` so they are
both treated equally. In L1 regularization, the same two solutions
favor the sparse one:
.. math::
||(1, 0)||_1 = 1 < ||(\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}})||_1 = \sqrt{2}
\tag{10}
So L2 regularization doesn't have any specific built in mechanisms to favor
zeroed out coefficients, while L1 regularization actually favors these sparser
solutions.
|h2| Conclusion |h2e|
L1 and L2 regularization are such intuitive techniques when viewed shallowly
as just extra terms in the objective function (i.e. "shrink the coefficients").
However, I was delighted to find out that it also has a Bayesian interpretation,
it just seems so much more elegant that way. As I have mentioned before, I'm a
big fan probabilistic interpretations of machine learning and you can expect
many more posts on this subject!
|h2| Further Reading |h2e|
* Wikipedia:
`Regularization `_,
`ordinary linear regression `_,
`Bayes Theorem `_,
`"weak" prior `_,
`maximum a posteriori probability estimate `_
* A previous post on `linear regression `_
* Machine Learning: A Probabilistic Perspective, Kevin P. Murphy
|br|
.. [1] One philosophical counterpoint is that we should "let the data speak for itself". Although superficially satisfying, it is almost always the case where you inject "prior" knowledge into interpreting the data. For example, selecting a linear regression model already adds some prior knowledge or intuition to the data. In the same way, if we have some vague idea that the mean of the data should be close to zero, why not add that information into the problem? If there's enough data and we've coded things right, then the prior isn't that impactful anyways.
.. [2] As you might have guessed, I'm fall into the Bayesian camp. Although I would have to say that I'm much more of a pragmatist above all else. I'll use whatever works, frequentist, Bayesian, no theoretical basis, doesn't really matter as long as I can solve the desired problem in a reasonable manner. It just so happens Bayesian methods produce reasonable estimates very often.
.. [3] Well that's not exactly true. As a modeler, we can pick a prior that does depend on the data, although that's a bit of "double dipping". These methods are generally known as `empirical Bayes methods `_.
.. [4] I got this example from Machine Learning: A Probabilistic Perspective. It has great explanations on both L1 and L2 regularization as long as you have some moderate fluency in probability. I highly recommend this textbook if you want to dig into the meat of ML techniques. It's very math (probability) heavy but really provides good high level explanations that help with intuition.