Bounded Rationalityhttp://bjlkeng.github.io/Understanding math, machine learning, and data to a satisfactory degree.enSat, 23 Dec 2023 02:03:09 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rssA Look at The First Place Solution of a Dermatology Classification Kaggle Competitionhttp://bjlkeng.github.io/posts/a-look-at-the-first-place-solution-of-a-dermatology-classification-kaggle-competition/Brian Keng<div><p>One interesting thing I often think about is the gap between academic and real-world
solutions. In general academic solutions play in the realm of idealized problem
spaces, removing themselves from needing to care about the messiness of the real-world.
<a class="reference external" href="https://www.kaggle.com/competitions">Kaggle</a>
competitions are a (small) step in the right direction towards dealing with messiness,
usually providing a true blind test set (vs. overused benchmarks), and opening a
few degrees of freedom in terms the techniques that can be used, which
usually eschews novelty in favour of more robust methods. To this end, I
thought it would be useful to take a look at a more realistic problem (via a
Kaggle competition) and understand the practical details that result in a
superior solution.</p>
<p>This post will cover the <a class="reference external" href="https://arxiv.org/abs/2010.05351">first place solution</a> [<a class="reference internal" href="http://bjlkeng.github.io/posts/a-look-at-the-first-place-solution-of-a-dermatology-classification-kaggle-competition/#id2">1</a>] to the
<a class="reference external" href="https://www.kaggle.com/competitions/siim-isic-melanoma-classification/overview">SIIM-ISIC Melanoma Classification</a> [<a class="reference internal" href="http://bjlkeng.github.io/posts/a-look-at-the-first-place-solution-of-a-dermatology-classification-kaggle-competition/#id1">0</a>] challenge.
In addition to using tried and true architectures (mostly EfficientNets), they
have some interesting tactics they use to formulate the problem, process the
data, and train/validate the model. I'll cover background on the
ML techniques, competition and data, architectural details, problem formulation, and
implementation. I've also run some experiments to better understand the
benefits of certain choices they made. Enjoy!</p>
<p><a href="http://bjlkeng.github.io/posts/a-look-at-the-first-place-solution-of-a-dermatology-classification-kaggle-competition/">Read more…</a> (36 min remaining to read)</p></div>augmentationCNNdatadermatologyEfficientNetmathjaxMobileNetNoisy Studentvalidation sethttp://bjlkeng.github.io/posts/a-look-at-the-first-place-solution-of-a-dermatology-classification-kaggle-competition/Sat, 23 Dec 2023 00:09:46 GMTLLM Fun: Building a Q&A Bot of Myselfhttp://bjlkeng.github.io/posts/building-a-qa-bot-of-me-with-openai-and-cloudflare/Brian Keng<div><p>Unless you've been living under a rock, you've probably heard of large language
models (LLM) such as ChatGPT or Bard. I'm not one for riding a hype train but
I do think LLMs are here to stay and either are going to have an impact as big
as mobile as an interface (my current best guess) or perhaps something as big as
the Internet itself. In either case, it behooves me to do a bit more
investigation into this popular trend <a class="footnote-reference brackets" href="http://bjlkeng.github.io/posts/building-a-qa-bot-of-me-with-openai-and-cloudflare/#id2" id="id1">1</a>. At the same time, there are a bunch
of other developer technologies that I've been wondering about like serverless
computing, modern dev tools, and LLM-based code assistants, so I thought why not
kill multiple birds with one stone.</p>
<p>This post is going to describe how I built a question and answering bot of myself using
LLMs as well as my experience using the relevant developer tools such as
<a class="reference external" href="https://chat.openai.com">ChatGPT</a>, <a class="reference external" href="https://github.com/features/copilot">Github Copilot</a>, <a class="reference external" href="https://workers.cloudflare.com/">Cloudflare workers</a>, and a couple of other related ones.
I start out with my motivation for doing this project, some brief background
on the technologies, a description of how I built everything including some
evaluation on LLM outputs, and finally some commentary. This post is a lot
less heavy on the math as compared to my previous ones but it still has some
good stuff so read on!</p>
<p><a href="http://bjlkeng.github.io/posts/building-a-qa-bot-of-me-with-openai-and-cloudflare/">Read more…</a> (41 min remaining to read)</p></div>CloudflareGPTJavascriptLangChainlarge language modelsLLMmathjaxOpenAIQ&Ahttp://bjlkeng.github.io/posts/building-a-qa-bot-of-me-with-openai-and-cloudflare/Mon, 25 Sep 2023 00:56:42 GMTBayesian Learning via Stochastic Gradient Langevin Dynamics and Bayes by Backprophttp://bjlkeng.github.io/posts/bayesian-learning-via-stochastic-gradient-langevin-dynamics-and-bayes-by-backprop/Brian Keng<div><p>After a long digression, I'm finally back to one of the main lines of research
that I wanted to write about. The two main ideas in this post are not that
recent but have been quite impactful (one of the
<a class="reference external" href="https://icml.cc/virtual/2021/test-of-time/11808">papers</a> won a recent ICML
test of time award). They address two of the topics that are near and dear to
my heart: Bayesian learning and scalability. Dare I even ask who wouldn't be
interested in the intersection of these topics?</p>
<p>This post is about two techniques to perform scalable Bayesian inference. They
both address the problem using stochastic gradient descent (SGD) but in very
different ways. One leverages the observation that SGD plus some noise will
converge to Bayesian posterior sampling <a class="citation-reference" href="http://bjlkeng.github.io/posts/bayesian-learning-via-stochastic-gradient-langevin-dynamics-and-bayes-by-backprop/#welling2011" id="id1">[Welling2011]</a>, while the other generalizes the
"reparameterization trick" from variational autoencoders to enable non-Gaussian
posterior approximations <a class="citation-reference" href="http://bjlkeng.github.io/posts/bayesian-learning-via-stochastic-gradient-langevin-dynamics-and-bayes-by-backprop/#blundell2015" id="id2">[Blundell2015]</a>. Both are easily implemented in the modern deep
learning toolkit thus benefit from the massive scalability of that toolchain.
As usual, I will go over the necessary background (or refer you to my previous
posts), intuition, some math, and a couple of toy examples that I implemented.</p>
<p><a href="http://bjlkeng.github.io/posts/bayesian-learning-via-stochastic-gradient-langevin-dynamics-and-bayes-by-backprop/">Read more…</a> (53 min remaining to read)</p></div>Bayes by BackpropBayesianelboHMCLangevinmathjaxrmspropsgdSGLDvariational inferencehttp://bjlkeng.github.io/posts/bayesian-learning-via-stochastic-gradient-langevin-dynamics-and-bayes-by-backprop/Wed, 08 Feb 2023 23:25:40 GMTAn Introduction to Stochastic Calculushttp://bjlkeng.github.io/posts/an-introduction-to-stochastic-calculus/Brian Keng<div><p>Through a couple of different avenues I wandered, yet again, down a rabbit hole
leading to the topic of this post. The first avenue was through my main focus
on a particular machine learning topic that utilized some concepts from
physics, which naturally led me to stochastic calculus. The second avenue was
through some projects at work in the quantitative finance space, which is one
of the main applications of stochastic calculus. Naively, I thought I could
write a brief post on it that would satisfy my curiosity -- that didn't work
out at all! The result is this extra long post.</p>
<p>This post is about stochastic calculus, an extension of regular calculus to
stochastic processes. It's not immediately obvious
but the rigour needed to properly understand some of the key ideas requires
going back to the measure theoretic definition of probability theory, so
that's where I start in the background. From there I quickly move on to
stochastic processes, the Wiener process, a particular flavour of stochastic
calculus called Itô calculus, and finally end with a couple of applications.
As usual, I try to include a mix of intuition, rigour where it helps intuition,
and some simple examples. It's a deep and wide topic so I hope you enjoy my
digest of it.</p>
<p><a href="http://bjlkeng.github.io/posts/an-introduction-to-stochastic-calculus/">Read more…</a> (72 min remaining to read)</p></div>Black-Scholes-MertonBrownian motionLangevinmathjaxmeasure theoryprobabilitysigma algebrastochastic calculusWeiner processwhite noisehttp://bjlkeng.github.io/posts/an-introduction-to-stochastic-calculus/Mon, 12 Sep 2022 01:05:55 GMTNormalizing Flows with Real NVPhttp://bjlkeng.github.io/posts/normalizing-flows-with-real-nvp/Brian Keng<div><p>This post has been a long time coming. I originally started working on it several posts back but
hit a roadblock in the implementation and then got distracted with some other ideas, which took
me down various rabbit holes (<a class="reference external" href="http://bjlkeng.github.io/posts/hamiltonian-monte-carlo/">here</a>,
<a class="reference external" href="http://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/">here</a>, and
<a class="reference external" href="http://bjlkeng.github.io/posts/lossless-compression-with-latent-variable-models-using-bits-back-coding/">here</a>).
It feels good to finally get back on track to some core ML topics.
The other nice thing about not being an academic researcher (not that I'm
really researching anything here) is that there is no pressure to do anything!
If it's just for fun, you can take your time with a topic, veer off track, and
the come back to it later. It's nice having the freedom to do what you want (this applies to
more than just learning about ML too)!</p>
<p>This post is going to talk about a class of deep probabilistic generative
models called normalizing flows. Alongside <a class="reference external" href="http://bjlkeng.github.io/posts/variational-autoencoders/">Variational Autoencoders</a>
and autoregressive models <a class="footnote-reference brackets" href="http://bjlkeng.github.io/posts/normalizing-flows-with-real-nvp/#id3" id="id1">1</a> (e.g. <a class="reference external" href="http://bjlkeng.github.io/posts/pixelcnn/">Pixel CNN</a> and
<a class="reference external" href="http://bjlkeng.github.io/posts/autoregressive-autoencoders/">Autoregressive autoencoders</a>),
normalizing flows have been one of the big ideas in deep probabilistic generative models (I don't count GANs because they are not quite probabilistic).
Specifically, I'll be presenting one of the earlier normalizing flow
techniques named <em>Real NVP</em> (circa 2016).
The formulation is simple but surprisingly effective, which makes it a good
candidate to understand more about normalizing flows.
As usual, I'll go over some background, the method, an implementation
(with commentary on the details), and some experimental results. Let's get into the flow!</p>
<p><a href="http://bjlkeng.github.io/posts/normalizing-flows-with-real-nvp/">Read more…</a> (32 min remaining to read)</p></div>CELEBACIFAR10generative modelsmathjaxMNISTnormalizing flowshttp://bjlkeng.github.io/posts/normalizing-flows-with-real-nvp/Sat, 23 Apr 2022 23:36:05 GMTHamiltonian Monte Carlohttp://bjlkeng.github.io/posts/hamiltonian-monte-carlo/Brian Keng<div><p>Here's a topic I thought that I would never get around to learning because it was "too hard".
When I first started learning about Bayesian methods, I knew enough that I
should learn a thing or two about MCMC since that's the backbone
of most Bayesian analysis; so I learned something about it
(see my <a class="reference external" href="http://bjlkeng.github.io/posts/markov-chain-monte-carlo-mcmc-and-the-metropolis-hastings-algorithm/">previous post</a>).
But I didn't dare attempt to learn about the infamous Hamiltonian Monte Carlo (HMC).
Even though it is among the standard algorithms used in Bayesian inference, it
always seemed too daunting because it required "advanced physics" to
understand. As usual, things only seem hard because you don't know them yet.
After having some time to digest MCMC methods, getting comfortable learning
more maths (see
<a class="reference external" href="http://bjlkeng.github.io/posts/tensors-tensors-tensors/">here</a>,
<a class="reference external" href="http://bjlkeng.github.io/posts/manifolds/">here</a>, and
<a class="reference external" href="http://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings/">here</a>),
all of a sudden learning "advanced physics" didn't seem so tough (but there
sure was a lot of background needed)!</p>
<p>This post is the culmination of many different rabbit holes (many much deeper
than I needed to go) where I'm going to attempt to explain HMC in simple and
intuitive terms to a satisfactory degree (that's the tag line of this blog
after all). I'm going to begin by briefly motivating the topic by reviewing
MCMC and the Metropolis-Hastings algorithm then move on to explaining
Hamiltonian dynamics (i.e., the "advanced physics"), and finally discuss the HMC
algorithm along with some toy experiments I put together. Most of the material
is based on [1] and [2], which I've found to be great sources for their
respective areas.</p>
<p><a href="http://bjlkeng.github.io/posts/hamiltonian-monte-carlo/">Read more…</a> (52 min remaining to read)</p></div>BayesianHamiltonianmathjaxMCMCMonte Carlohttp://bjlkeng.github.io/posts/hamiltonian-monte-carlo/Fri, 24 Dec 2021 00:07:05 GMTLossless Compression with Latent Variable Models using Bits-Back Codinghttp://bjlkeng.github.io/posts/lossless-compression-with-latent-variable-models-using-bits-back-coding/Brian Keng<div><p>A lot of modern machine learning is related to this idea of "compression", or
maybe to use a fancier term "representations". Taking a huge dimensional space
(e.g. images of 256 x 256 x 3 pixels = 196608 dimensions) and somehow compressing it into
a 1000 or so dimensional representation seems like pretty good compression to
me! Unfortunately, it's not a lossless compression (or representation).
Somehow though, it seems intuitive that there must be a way to use what is learned in
these powerful lossy representations to help us better perform <em>lossless</em>
compression, right? Of course there is! (It would be too anti-climatic of a
setup otherwise.)</p>
<p>This post is going to introduce a method to perform lossless compression that
leverages the learned "compression" of a machine learning latent variable
model using the Bits-Back coding algorithm. Depending on how you first think
about it, this <em>seems</em> like it should either be (a) really easy or (b) not possible at
all. The reality is kind of in between with an elegant theoretical algorithm
that is brought down by the realities of discretization and imperfect learning
by the model. In today's post, I'll skim over some preliminaries (mostly
referring you to previous posts), go over the main Bits-Back coding algorithm
in detail, and discuss some of the implementation details and experiments that
I did while trying to write a toy version of the algorithm.</p>
<p><a href="http://bjlkeng.github.io/posts/lossless-compression-with-latent-variable-models-using-bits-back-coding/">Read more…</a> (25 min remaining to read)</p></div>asymmetric numeral systemsBits-BackcompressionlosslessmathjaxMNISTvariational autoencoderhttp://bjlkeng.github.io/posts/lossless-compression-with-latent-variable-models-using-bits-back-coding/Tue, 06 Jul 2021 16:00:00 GMTLossless Compression with Asymmetric Numeral Systemshttp://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/Brian Keng<div><p>During my undergraduate days, one of the most interesting courses I took was on
coding and compression. Here was a course that combined algorithms,
probability and secret messages, what's not to like? <a class="footnote-reference brackets" href="http://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/#id2" id="id1">1</a> I ended up not going
down that career path, at least partially because communications systems had
its heyday around the 2000s with companies like Nortel and Blackberry and its
predecessors (some like to joke that all the major theoretical breakthroughs
were done by Shannon and his discovery of information theory around 1950). Fortunately, I
eventually wound up studying industrial applications of classical AI techniques
and then machine learning, which has really grown like crazy in the last 10
years or so. Which is exactly why I was so surprised that a <em>new</em> and <em>better</em>
method of lossless compression was developed in 2009 <em>after</em> I finished my
undergraduate degree when I was well into my PhD. It's a bit mind boggling that
something as well-studied as entropy-based lossless compression still had
(have?) totally new methods to discover, but I digress.</p>
<p>In this post, I'm going to write about a relatively new entropy based encoding
method called Asymmetrical Numeral Systems (ANS) developed by Jaroslaw (Jarek)
Duda [2]. If you've ever heard of Arithmetic Coding (probably best known for
its use in JPEG compression), ANS runs in a very similar vein. It can
generate codes that are close to the theoretical compression limit
(similar to Arithmetic coding) but is <em>much</em> more efficient. It's been used in
modern compression algorithms since 2014 including compressors developed
by Facebook, Apple and Google [3]. As usual, I'm going to go over some
background, some math, some examples to help with intuition, and finally some
experiments with a toy ANS implementation I wrote. I hope you're as
excited as I am, let's begin!</p>
<p><a href="http://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/">Read more…</a> (32 min remaining to read)</p></div>Arithmetic Codingasymmetric numeral systemscompressionentropyHuffman codingmathjaxhttp://bjlkeng.github.io/posts/lossless-compression-with-asymmetric-numeral-systems/Sat, 26 Sep 2020 14:37:43 GMTModel Explainability with SHapley Additive exPlanations (SHAP)http://bjlkeng.github.io/posts/model-explanability-with-shapley-additive-explanations-shap/Brian Keng<div><p>One of the big criticisms of modern machine learning is that it's essentially
a blackbox -- data in, prediction out, that's it. And in some sense, how could
it be any other way? When you have a highly non-linear model with high degrees
of interactions, how can you possibly hope to have a simple understanding of
what the model is doing? Well, turns out there is an interesting (and
practical) line of research along these lines.</p>
<p>This post will dive into the ideas of a popular technique published in the last
few years call <em>SHapely Additive exPlanations</em> (or SHAP). It builds upon
previous work in this area by providing a unified framework to think
about explanation models as well as a new technique with this framework that
uses Shapely values. I'll go over the math, the intuition, and how it works.
No need for an implementation because there is already a nice little Python
package! Confused yet? Keep reading and I'll <em>explain</em>.</p>
<p><a href="http://bjlkeng.github.io/posts/model-explanability-with-shapley-additive-explanations-shap/">Read more…</a> (26 min remaining to read)</p></div>explainabilitygame theorymathjaxSHAPhttp://bjlkeng.github.io/posts/model-explanability-with-shapley-additive-explanations-shap/Wed, 12 Feb 2020 11:24:22 GMTA Note on Using Log-Likelihood for Generative Modelshttp://bjlkeng.github.io/posts/a-note-on-using-log-likelihood-for-generative-models/Brian Keng<div><p>One of the things that I find is usually missing from many ML papers is how
they relate to the fundamentals. There's always a throwaway line where it
assumes something that is not at all obvious (see my post on
<a class="reference external" href="http://bjlkeng.github.io/posts/importance-sampling-and-estimating-marginal-likelihood-in-variational-autoencoders/">Importance Sampling</a>). I'm the kind of person who likes to
understand things to a satisfactory degree (it's literally in the subtitle of
the blog) so I couldn't help myself investigating a minor idea I read about in
a paper.</p>
<p>This post investigates how to use continuous density outputs (e.g. a logistic
or normal distribution) to model discrete image data (e.g. 8-bit RGB values).
It seems like it might be something obvious such as setting the loss as the
average log-likelihood of the continuous density and that's <em>almost</em> the
whole story. But leaving it at that skips over so many (interesting) and
non-obvious things that you would never know if you didn't bother to look. I'm
a curious fellow so come with me and let's take a look!</p>
<p><a href="http://bjlkeng.github.io/posts/a-note-on-using-log-likelihood-for-generative-models/">Read more…</a> (15 min remaining to read)</p></div>generative modelslog-likelihoodmathjaxhttp://bjlkeng.github.io/posts/a-note-on-using-log-likelihood-for-generative-models/Tue, 27 Aug 2019 11:50:09 GMT