Bounded Rationality (Posts about Bayesian)

Bayesian Learning via Stochastic Gradient Langevin Dynamics and Bayes by Backprop

Brian Keng — Wed, 08 Feb 2023 23:25:40 GMT

After a long digression, I'm finally back to one of the main lines of research that I wanted to write about. The two main ideas in this post are not that recent but have been quite impactful (one of the papers won a recent ICML test of time award). They address two of the topics that are near and dear to my heart: Bayesian learning and scalability. Dare I even ask who wouldn't be interested in the intersection of these topics?

This post is about two techniques to perform scalable Bayesian inference. They both address the problem using stochastic gradient descent (SGD) but in very different ways. One leverages the observation that SGD plus some noise will converge to Bayesian posterior sampling [Welling2011], while the other generalizes the "reparameterization trick" from variational autoencoders to enable non-Gaussian posterior approximations [Blundell2015]. Both are easily implemented in the modern deep learning toolkit thus benefit from the massive scalability of that toolchain. As usual, I will go over the necessary background (or refer you to my previous posts), intuition, some math, and a couple of toy examples that I implemented.

Hamiltonian Monte Carlo

Brian Keng — Fri, 24 Dec 2021 00:07:05 GMT

Here's a topic I thought that I would never get around to learning because it was "too hard". When I first started learning about Bayesian methods, I knew enough that I should learn a thing or two about MCMC since that's the backbone of most Bayesian analysis; so I learned something about it (see my previous post). But I didn't dare attempt to learn about the infamous Hamiltonian Monte Carlo (HMC). Even though it is among the standard algorithms used in Bayesian inference, it always seemed too daunting because it required "advanced physics" to understand. As usual, things only seem hard because you don't know them yet. After having some time to digest MCMC methods, getting comfortable learning more maths (see here, here, and here), all of a sudden learning "advanced physics" didn't seem so tough (but there sure was a lot of background needed)!

This post is the culmination of many different rabbit holes (many much deeper than I needed to go) where I'm going to attempt to explain HMC in simple and intuitive terms to a satisfactory degree (that's the tag line of this blog after all). I'm going to begin by briefly motivating the topic by reviewing MCMC and the Metropolis-Hastings algorithm then move on to explaining Hamiltonian dynamics (i.e., the "advanced physics"), and finally discuss the HMC algorithm along with some toy experiments I put together. Most of the material is based on [1] and [2], which I've found to be great sources for their respective areas.

Variational Bayes and The Mean-Field Approximation

Brian Keng — Mon, 03 Apr 2017 13:02:46 GMT

This post is going to cover Variational Bayesian methods and, in particular, the most common one, the mean-field approximation. This is a topic that I've been trying to understand for a while now but didn't quite have all the background that I needed. After picking up the main ideas from variational calculus and getting more fluent in manipulating probability statements like in my EM post, this variational Bayes stuff seems a lot easier.

Variational Bayesian methods are a set of techniques to approximate posterior distributions in Bayesian Inference. If this sounds a bit terse, keep reading! I hope to provide some intuition so that the big ideas are easy to understand (which they are), but of course we can't do that well unless we have a healthy dose of mathematics. For some of the background concepts, I'll try to refer you to good sources (including my own), which I find is the main blocker to understanding this subject (admittedly, the math can sometimes be a bit cryptic too). Enjoy!

A Probabilistic Interpretation of Regularization

Brian Keng — Mon, 29 Aug 2016 12:52:33 GMT

This post is going to look at a probabilistic (Bayesian) interpretation of regularization. We'll take a look at both L1 and L2 regularization in the context of ordinary linear regression. The discussion will start off with a quick introduction to regularization, followed by a back-to-basics explanation starting with the maximum likelihood estimate (MLE), then on to the maximum a posteriori estimate (MAP), and finally playing around with priors to end up with L1 and L2 regularization.

A Probabilistic View of Linear Regression

Brian Keng — Sun, 15 May 2016 00:43:05 GMT

One thing that I always disliked about introductory material to linear regression is how randomness is explained. The explanations always seemed unintuitive because, as I have frequently seen it, they appear as an after thought rather than the central focus of the model. In this post, I'm going to try to take another approach to building an ordinary linear regression model starting from a probabilistic point of view (which is pretty much just a Bayesian view). After the general idea is established, I'll modify the model a bit and end up with a Poisson regression using the exact same principles showing how generalized linear models aren't any more complicated. Hopefully, this will help explain the "randomness" in linear regression in a more intuitive way.

Normal Approximation to the Posterior Distribution

Brian Keng — Sat, 02 Apr 2016 19:22:54 GMT

In this post, I'm going to write about how the ever versatile normal distribution can be used to approximate a Bayesian posterior distribution. Unlike some other normal approximations, this is not a direct application of the central limit theorem. The result has a straight forward proof using Laplace's Method whose main ideas I will attempt to present. I'll also simulate a simple scenario to see how it works in practice.