Bounded Rationality

The Logic Behind the Maximum Entropy Principle

Brian Keng — Sat, 03 Aug 2024 00:44:59 GMT

For a while now, I've really enjoyed diving deep to understand probability and related fundamentals (see here, here, and here). Entropy is a topic that comes up all over the place from physics to information theory, and of course, machine learning. I written about it in various different forms but always taken it as a given as the "expected information". Well I found a few of good explanations about how to "derive" it and thought that I should share.

In this post, I'll be showing a few of derivations of the maximum entropy principle, where entropy appears as part of the definition. These derivations will show why it is a reasonable and natural thing to maximize, and how it is determined from some well thought out reasoning. This post will be more math heavy but hopefully it will give you more insight into this wonderfully surprising topic.

Iterative Summarization using LLMs

Brian Keng — Tue, 04 Jun 2024 00:21:43 GMT

After being busy for the first part of the year, I finally have a bit of time to work on this blog. After a lot of thinking about how to best fit it into my schedule, I've decided to attempt to write shorter posts. Although I do get a lot of satisfaction writing long posts, it's not practical because of the time commitment. Better to break it up into smaller parts to be able to "ship" often rather than perfect each post. This also allows me to experiment with smaller scoped topics, which hopefully will keep more more motivated as well. Speaking of which...

This post is about answering a random thought I had the other day: what would happen if I kept passing an LLM's output back to itself? I ran a few experiments of trying to get the LLM to iteratively summarize or rephrase a piece of text and the results are... pretty much what you would expect. But if you don't know what to expect, then read on and find out what happened!

A Look at The First Place Solution of a Dermatology Classification Kaggle Competition

Brian Keng — Sat, 23 Dec 2023 00:09:46 GMT

One interesting thing I often think about is the gap between academic and real-world solutions. In general academic solutions play in the realm of idealized problem spaces, removing themselves from needing to care about the messiness of the real-world. Kaggle competitions are a (small) step in the right direction towards dealing with messiness, usually providing a true blind test set (vs. overused benchmarks), and opening a few degrees of freedom in terms the techniques that can be used, which usually eschews novelty in favour of more robust methods. To this end, I thought it would be useful to take a look at a more realistic problem (via a Kaggle competition) and understand the practical details that result in a superior solution.

This post will cover the first place solution [1] to the SIIM-ISIC Melanoma Classification [0] challenge. In addition to using tried and true architectures (mostly EfficientNets), they have some interesting tactics they use to formulate the problem, process the data, and train/validate the model. I'll cover background on the ML techniques, competition and data, architectural details, problem formulation, and implementation. I've also run some experiments to better understand the benefits of certain choices they made. Enjoy!

LLM Fun: Building a Q&A Bot of Myself

Brian Keng — Mon, 25 Sep 2023 00:56:42 GMT

Unless you've been living under a rock, you've probably heard of large language models (LLM) such as ChatGPT or Bard. I'm not one for riding a hype train but I do think LLMs are here to stay and either are going to have an impact as big as mobile as an interface (my current best guess) or perhaps something as big as the Internet itself. In either case, it behooves me to do a bit more investigation into this popular trend 1. At the same time, there are a bunch of other developer technologies that I've been wondering about like serverless computing, modern dev tools, and LLM-based code assistants, so I thought why not kill multiple birds with one stone.

This post is going to describe how I built a question and answering bot of myself using LLMs as well as my experience using the relevant developer tools such as ChatGPT, Github Copilot, Cloudflare workers, and a couple of other related ones. I start out with my motivation for doing this project, some brief background on the technologies, a description of how I built everything including some evaluation on LLM outputs, and finally some commentary. This post is a lot less heavy on the math as compared to my previous ones but it still has some good stuff so read on!

Bayesian Learning via Stochastic Gradient Langevin Dynamics and Bayes by Backprop

Brian Keng — Wed, 08 Feb 2023 23:25:40 GMT

After a long digression, I'm finally back to one of the main lines of research that I wanted to write about. The two main ideas in this post are not that recent but have been quite impactful (one of the papers won a recent ICML test of time award). They address two of the topics that are near and dear to my heart: Bayesian learning and scalability. Dare I even ask who wouldn't be interested in the intersection of these topics?

This post is about two techniques to perform scalable Bayesian inference. They both address the problem using stochastic gradient descent (SGD) but in very different ways. One leverages the observation that SGD plus some noise will converge to Bayesian posterior sampling [Welling2011], while the other generalizes the "reparameterization trick" from variational autoencoders to enable non-Gaussian posterior approximations [Blundell2015]. Both are easily implemented in the modern deep learning toolkit thus benefit from the massive scalability of that toolchain. As usual, I will go over the necessary background (or refer you to my previous posts), intuition, some math, and a couple of toy examples that I implemented.

An Introduction to Stochastic Calculus

Brian Keng — Mon, 12 Sep 2022 01:05:55 GMT

Through a couple of different avenues I wandered, yet again, down a rabbit hole leading to the topic of this post. The first avenue was through my main focus on a particular machine learning topic that utilized some concepts from physics, which naturally led me to stochastic calculus. The second avenue was through some projects at work in the quantitative finance space, which is one of the main applications of stochastic calculus. Naively, I thought I could write a brief post on it that would satisfy my curiosity -- that didn't work out at all! The result is this extra long post.

This post is about stochastic calculus, an extension of regular calculus to stochastic processes. It's not immediately obvious but the rigour needed to properly understand some of the key ideas requires going back to the measure theoretic definition of probability theory, so that's where I start in the background. From there I quickly move on to stochastic processes, the Wiener process, a particular flavour of stochastic calculus called Itô calculus, and finally end with a couple of applications. As usual, I try to include a mix of intuition, rigour where it helps intuition, and some simple examples. It's a deep and wide topic so I hope you enjoy my digest of it.

Normalizing Flows with Real NVP

Brian Keng — Sat, 23 Apr 2022 23:36:05 GMT

This post has been a long time coming. I originally started working on it several posts back but hit a roadblock in the implementation and then got distracted with some other ideas, which took me down various rabbit holes (here, here, and here). It feels good to finally get back on track to some core ML topics. The other nice thing about not being an academic researcher (not that I'm really researching anything here) is that there is no pressure to do anything! If it's just for fun, you can take your time with a topic, veer off track, and the come back to it later. It's nice having the freedom to do what you want (this applies to more than just learning about ML too)!

This post is going to talk about a class of deep probabilistic generative models called normalizing flows. Alongside Variational Autoencoders and autoregressive models 1 (e.g. Pixel CNN and Autoregressive autoencoders), normalizing flows have been one of the big ideas in deep probabilistic generative models (I don't count GANs because they are not quite probabilistic). Specifically, I'll be presenting one of the earlier normalizing flow techniques named Real NVP (circa 2016). The formulation is simple but surprisingly effective, which makes it a good candidate to understand more about normalizing flows. As usual, I'll go over some background, the method, an implementation (with commentary on the details), and some experimental results. Let's get into the flow!

Hamiltonian Monte Carlo

Brian Keng — Fri, 24 Dec 2021 00:07:05 GMT

Here's a topic I thought that I would never get around to learning because it was "too hard". When I first started learning about Bayesian methods, I knew enough that I should learn a thing or two about MCMC since that's the backbone of most Bayesian analysis; so I learned something about it (see my previous post). But I didn't dare attempt to learn about the infamous Hamiltonian Monte Carlo (HMC). Even though it is among the standard algorithms used in Bayesian inference, it always seemed too daunting because it required "advanced physics" to understand. As usual, things only seem hard because you don't know them yet. After having some time to digest MCMC methods, getting comfortable learning more maths (see here, here, and here), all of a sudden learning "advanced physics" didn't seem so tough (but there sure was a lot of background needed)!

This post is the culmination of many different rabbit holes (many much deeper than I needed to go) where I'm going to attempt to explain HMC in simple and intuitive terms to a satisfactory degree (that's the tag line of this blog after all). I'm going to begin by briefly motivating the topic by reviewing MCMC and the Metropolis-Hastings algorithm then move on to explaining Hamiltonian dynamics (i.e., the "advanced physics"), and finally discuss the HMC algorithm along with some toy experiments I put together. Most of the material is based on [1] and [2], which I've found to be great sources for their respective areas.

Lossless Compression with Latent Variable Models using Bits-Back Coding

Brian Keng — Tue, 06 Jul 2021 16:00:00 GMT

A lot of modern machine learning is related to this idea of "compression", or maybe to use a fancier term "representations". Taking a huge dimensional space (e.g. images of 256 x 256 x 3 pixels = 196608 dimensions) and somehow compressing it into a 1000 or so dimensional representation seems like pretty good compression to me! Unfortunately, it's not a lossless compression (or representation). Somehow though, it seems intuitive that there must be a way to use what is learned in these powerful lossy representations to help us better perform lossless compression, right? Of course there is! (It would be too anti-climatic of a setup otherwise.)

This post is going to introduce a method to perform lossless compression that leverages the learned "compression" of a machine learning latent variable model using the Bits-Back coding algorithm. Depending on how you first think about it, this seems like it should either be (a) really easy or (b) not possible at all. The reality is kind of in between with an elegant theoretical algorithm that is brought down by the realities of discretization and imperfect learning by the model. In today's post, I'll skim over some preliminaries (mostly referring you to previous posts), go over the main Bits-Back coding algorithm in detail, and discuss some of the implementation details and experiments that I did while trying to write a toy version of the algorithm.

Lossless Compression with Asymmetric Numeral Systems

Brian Keng — Sat, 26 Sep 2020 14:37:43 GMT

During my undergraduate days, one of the most interesting courses I took was on coding and compression. Here was a course that combined algorithms, probability and secret messages, what's not to like? 1 I ended up not going down that career path, at least partially because communications systems had its heyday around the 2000s with companies like Nortel and Blackberry and its predecessors (some like to joke that all the major theoretical breakthroughs were done by Shannon and his discovery of information theory around 1950). Fortunately, I eventually wound up studying industrial applications of classical AI techniques and then machine learning, which has really grown like crazy in the last 10 years or so. Which is exactly why I was so surprised that a new and better method of lossless compression was developed in 2009 after I finished my undergraduate degree when I was well into my PhD. It's a bit mind boggling that something as well-studied as entropy-based lossless compression still had (have?) totally new methods to discover, but I digress.

In this post, I'm going to write about a relatively new entropy based encoding method called Asymmetrical Numeral Systems (ANS) developed by Jaroslaw (Jarek) Duda [2]. If you've ever heard of Arithmetic Coding (probably best known for its use in JPEG compression), ANS runs in a very similar vein. It can generate codes that are close to the theoretical compression limit (similar to Arithmetic coding) but is much more efficient. It's been used in modern compression algorithms since 2014 including compressors developed by Facebook, Apple and Google [3]. As usual, I'm going to go over some background, some math, some examples to help with intuition, and finally some experiments with a toy ANS implementation I wrote. I hope you're as excited as I am, let's begin!