The Logic Behind the Maximum Entropy Principle
For a while now, I've really enjoyed diving deep to understand probability and related fundamentals (see here, here, and here). Entropy is a topic that comes up all over the place from physics to information theory, and of course, machine learning. I written about it in various different forms but always taken it as a given as the "expected information". Well I found a few of good explanations about how to "derive" it and thought that I should share.
In this post, I'll be showing a few of derivations of the maximum entropy principle, where entropy appears as part of the definition. These derivations will show why it is a reasonable and natural thing to maximize, and how it is determined from some well thought out reasoning. This post will be more math heavy but hopefully it will give you more insight into this wonderfully surprising topic.
Table of Contents
1 Motivation
To understand entropy, let's first concoct a situation where we might need a new tool beyond Bayesian probability. Suppose we have a die that has faces from 1 to 6 where we define the random variable \(X\) to be equal to the faceup number from a given die roll. If we are given samples of die rolls then, with some appropriate prior, we can iteratively apply Bayes' theorem to determine the probability \(P(X=i) = p_i\) where \(i\) is die face value. This type of calculation is relatively straight forward, maybe tedious, but straight forward. But what if we don't have explicit samples but a different type of observable information?
Now imagine we don't have samples of that die roll. Instead we only know its mean roll, that is \(E[X] = \mu_X\), what would our best guess be of the probability distribution of \(X\)? For example, if all the die faces had equal probability we would get:
So if \(\mu_X = 3.5\) then we might reasonably guess that \(X\) had a uniform distribution.
But what if \(\mu_X = 3\)? It could be that \(p_3=1.0\), but that seems unlikely. Maybe something like this is slightly more reasonable?
But still that zero probability for \(p_6\) seems kind of off. One way to approach it is to apply a prior to each \(p_i\) and marginalizing over all of them but ensuring it matches our information of \(E[X] = \mu_X\). But that only shifts the problem to finding the right priors to match our observation, not how to directly incorporate our information of \(E[X] = \mu_X\).
From this example, we have gleaned some important insights though. Somehow concentrating all the mass together with \(p_3=1.0\) is not exactly what we want, and while Equation 2 might seem slightly more reasonable, it also seems odd that we have \(p_6=0\). This gives us intuition that we want to be conservative and "spread out" our probability, and at the same time assigning an outcome with \(0\) probability only if it is truly ruled out. This spreading allows us to be noncommittal about the distribution. So in some sense we want to maximize the spread given the constraint, which already sounds like something familiar perhaps? Let's keep going and find out.
2 Entropy and The Principle of Maximum Entropy
Information theoretic entropy is typically interpreted as the average amount of "information", "surprise" or "uncertainty" in a given random variable. Given a random variable \(X\) that has \(m\) discrete outcomes, we can define entropy as:
where \(p_x := P(X=x)\).
The principle of maximum entropy states that given testable information, the probability distribution that best represents the current state of knowledge is the one with the largest (information) entropy. The testable information is a statement about the probability distribution where you can easily evaluate it to be true or false. It usually comes in the form of a mean, variance, or some moments of the probability distribution. This rule can be thought of expressing epistemic modesty, or maximal ignorance, because it makes the least strong claim on a distribution beyond being informed by the testable information.
See my previous post on maximum entropy distributions for a more detailed treatment.
3 Wallis Derivation
The Wallis derivation of maximum entropy [1 ,2] starts off with a conceptually simple argument, which through some calculation ends up with our tried and true measure of entropy. It's nice because it does not require defining information as \(\log p\) upfront and then making some argument that we want to maximize it's expected value but gets there directly instead.
To start, let's assume the problem from the previous section: we want to find the probability distribution of a discrete random variable that has \(m\) discrete values given some testable information about the distribution such as it's expected value or variance (or more generally one of its moments). Given any candidate probability distribution represented by a vector \({\bf p} = [p_1, \ldots, p_m]\), we can easily test to see if it satisfies our constraint. To this end, let's conduct the following thought experiment:
Distribute \(N\) quanta of probability (each worth \(\frac{1}{N}\)) uniformly randomly across the \(m\) possibilities for some large \(N\).
Once finished, check if the constraint is satisfied. If so, then that is your desired probability assignment. If not reject and go back to step 1.
If we do get an acceptance, our distribution will have \(p_i = \frac{n_i}{N}\) where \(\sum_{i=1}^m n_i = N\). Note: for any reasonably large \(N\) and skewed distribution, it will take an astronomically large number of iterations to accept, but it is only a thought experiment.
Now why is this a reasonable way to approach the problem? First, we're uniformly randomly distributing our quanta of probability in step 1. It's hard to argue that we're being biased in any way. Second, if we pick a large enough \(N\), the chances of getting a "weird" probability distribution (like the \(p_3=1.0\) from the previous section) over a more reasonable one becomes vanishing small. So even though we're stopping at the first one, chances are it's a pretty reasonable distribution.
Assuming we're happy with that reasoning, there is still the problem of picking a large enough \(N\) and running many iterations in order to get to an accepted distribution. Instead, let's just calculate the most probable result from this experiment, which should be the most reasonable choice anyways. We can see the probability of any particular assignment of our probability quanta is a multinomial distribution with the probability of a quanta being assigned to an outcome being a constant \(q_i = \frac{1}{m}\):
since \(m\) is a constant in this problem, it suffices to maximize the first factor, which we'll call the multiplicity of the outcome denoted by \(W\):
But we can equivalently maximize any monotonically increasing function of \(W\), so let's try it with \(\frac{1}{N}\log W\):
The factorials in Equation 6 are annoying to deal with but thankfully we can use Sterling's approximation:
With Equation 7 in hand, we can simplify Equation 6 and take the limit as \(N \to \infty\) so we reduce our dependence on finite \(N\):
Equation 8 shows that if we follow the logic of the above procedure, the "fair" probability distribution is equivalent to maximizing the entropy of the distribution. Notice that we did not mention "information", "surprise", or "uncertainty" here. We are simply doing the above thought experiment and it turns out we're maximizing \(E(\log X)\). In this manner, we might as well give a name to \(\log p\), which is the Shannon information of a particular event. This is nice because it doesn't require us to make any big leaps of assuming that \(\log p\) has any meaning.
4 Physical Derivation
This derivation is from [3] which is not exactly a derivation of the concept of entropy but the functional form. It starts out with an observation in physical systems involving a collection of equivalent elementary units where:
Elementary units (e.g. particles) have some associated probability \(p_j\) of taking on some numeric value \(j\) (e.g. energy level), i.e., random variables.
We observe some measurable quantity \(U\) of the entire system (e.g. average temperature).
The probability distribution of the elementary particles observed is the one maximizes the number of ways in which the particles can be arranged such that the system still measures \(U\) (hint: this is the multiplicity \(W\) from above, which is equivalent to maximum entropy).
We'll make this more precise, but first let's look at some examples.
Dice: Given a die with \(j=1,2,3,...,m\) faces, roll this die N times, compute the average value of the faces you see. What you will find is that the maximum entropy principle predicts the probabilities of rolling each face of the die. In general, this will be exponential or flat in the case of unbiased die.
Thermal system, canonical ensemble; temperature known: Given N particles in a thermodynamic system, the numeric value of each particle is the energy state \(\varepsilon_j\) of each particle. Given a temperature T, which is equivalent to knowing the average energy, maximum entropy predicts the Boltzmann distribution, \(p_j \propto \text{exp}[\varepsilon_i/(kT)]\), which is what we observe.
Waiting Time Processes: Consider you are watching cars pass by on a road and you measure the time between cars passing by as \(\tau_j\). After observing \(N\) cars, you measure the average waiting times between cars \(T/N = E(\tau)\) where \(T\) is the total waiting time. What you will observe is that again maximum entropy predicts that the wait times will be exponentially distributed \(\text{exp}(\lambda\tau_j)\).
In each of these situations maximum entropy is observed to be maximizing the number of ways you can arrange the elementary units such that the given constraint (\(U\)) is satisfied. In other words, we want to maximize the quantity \(W\) known as multiplicity which is the number of ways in which the system can realize the observable \(U\) from the elementary units.
4.1 Defining Multiplicity
Briefly repeating a variation of the previous section, if we have \(N\) elementary units, each of which can take on \(m\) different values, given a set of observations \(n_1, n_2, ... n_m\) where \(\sum_{i=1}^m n_i=N\), we can count the number of ways they can be arranged as the multiplicity (same as Equation 5):
Assuming that \(N\) is large, we would expect \(\frac{n_i}{N} \approx p_i\), the probability of each elementary unit taking on value \(i\). Using an alternate form of Sterling's approximation for large \(N\) (we drop \(\sqrt{2\pi n}\) factor since when we later take logarithms it is negligible):
Plugging this into Equation 9, we get:
Which you'll notice already resembles the exponentiated form of entropy we expect.
The definition of multiplicity in Equation 11 defines the number of ways the system can realize particular values of \(n_1, n_2, \ldots, n_m\). However, we don't just want an arbitrary configuration, we want the one that satisfies our observation \(U\) (e.g. expected value). That is, only count configurations that satisfy the constraint \(U\). We'll denote a multiplicity that satisfies \(U\) as \(W(p_1,\ldots, p_m, U)\) or simply \(W(U)\) when the context is clear.
Our goal now is to find the functional form of a new quantity we'll call entropy \(S[W(p_1,\ldots, p_m, U)]\), such that its extremum picks out the particular set of \(p_1,\ldots, p_m\) that maximize \(W(p_1,\ldots, p_m, U)\). From here, you can already see that the logarithm of Equation 11 will probably work out, but we'll show that this is actually the only choice that works.
4.2 Showing Entropy is Extensive
An extensive property \(P(\mathcal{R})\) of a system \(\mathcal{R}\) has these conditions:

Additivity: If the system \(\mathcal{R}\) can be divided into two subsystems \(\mathcal{R}_1\) and \(\mathcal{R}_2\) then:
\begin{equation*} P(\mathcal{R}) = P(\mathcal{R}_1) + P(\mathcal{R}_2) \tag{12} \end{equation*} 
Scalability: If the size of the system \(\mathcal{R}\) is scaled by a positive factor \(\alpha\) then:
\begin{equation*} P(\alpha \mathcal{R}) = \alpha P(\mathcal{R}) \tag{13} \end{equation*}
We'll start by showing the first property since the second one follows from our end result.
We wish to find entropy \(S(p_1, \ldots, p_m)\) that is maximal where \(W\) is maximal that also satisfies the following conditions:
where \(x_j\) is the \(j^{th}\) value of the random variable for each elementary unit. Equation 14 just says that \(p_j\) form a probability distribution and that the multiplicity satisfies our observed measurement  the average value of the observations (e.g. temperature).
Since we wish to find a maximum under constraints, we'll use Lagrange multipliers. Recall that we can set up the Lagrangian as:
where \(\alpha, \lambda\) are our Lagrange multipliers for the constraints in Equation 14, which also include the constants on the RHS. The extrema can be found by finding where each of the partial derivatives equals to zero. Taking the partial with respect to \(p_j\) and setting to zero gives us:
The total differential that gives the infinitesimal variation for \(S\) can be written by plugging in Equation 16:
Now here comes the argument for why entropy is extensive: Let's arbitrarily partition our system into two subsystems \(a\) and \(b\). Each subsystem will have \(N_a\) and \(N_b\) elementary units (e.g. particles), each of which can have multiplicity \(W_a(U_a)\) and \(W_b(U_b)\) respectively for given observations \(U_a\) and \(U_b\). To make it even more general, the number of different values for each subsystem can also be different with \(m_a\) and \(m_b\) potentially being different. Since it is a partition, we have \(N=N_a+N_B\) and \(W(U) = W_a(U_a)W_b(U_b)\) where the second one follows from simply counting all the combined possibilities.
Similarly, each subsystem will have constraints that mirror Equation 14/16 (probability constraint and observed average value). Thus, we can use Equation 17 to see that the total differential for each subsystem is:
But since the two subsystems are a partition of the total system, we can write the total differential for the entire system as a function of all the component parts \(S(p_{1a},\ldots, p_{ma}, p_{1b}, \ldots, p_{mb})\) with the four different constraints (two from each system):
Notice that we did not make any assumptions about the form of entropy, the only assumption we made is about the relation to a physical system. Equation 19 shows (with some integration) that entropy is additive:
where \(C\) is a constant. The scaling can be shown to be satisfied once we find out that our functional form is a logarithm since increasing the number of particles in a system by \(\alpha\) exponentiates the multiplicity \(W(U)^\alpha\). Thus entropy is extensive.
4.3 Deriving the Functional Form of Entropy
Once we have shown entropy is additive, we can do some manipulation to show it must have a logarithmic form. First, let's simply the notation \(u := W_a(U_a), v := W_b(U_b), r := W(U) = W_aW_b = uv\). Rewriting Equation 20 with this notation:
We can take the derivative of the left side with respect to \(v\):
Now taking the derivative of the right hand side of Equation 21, we get:
Equating Equation 22/23:
Symmetrically if we take the derivatives with respect to \(u\) in Equation 21, we also get:
Equating Equation 24/25 using \(\frac{dS}{dr}\), we have:
where \(k\) is a constant. The reason they are equal to a constant is the left side is a function only of \(u\), while the right hand side is only a function of \(v\), thus the only way two arbitrary functions of different variables can be equal is if they are equal to the same constant.
Taking one side, we can solve the differential equation:
where \(C_a\) is the constant of integration. You also get a similar result for the other side. Putting it together:
We are free to choose the constant of integration such as \(S(1) = 0\), which sets \(C'=0\). Finally, plugging back the expression for \(W\) from Equation 11 in:
We can define \(k'\) however we wish, so let's set it to \(k' = 1\), thus we get the our expected expression for entropy.
5 Jaynes' Derivation
Jaynes [1] has another derivation that is somewhat similar to the physical derivation except he starts with desiderata of what we would like from an entropy measure. Instead of elementary particles, he shows using the rules of probability that an event can be recursively broken down into "sub events" showing that entropy must be additive. From there, he is a bit more careful showing that entropy and the multiplicityequivalent variable would logarithmic if we assumed it to be continuous. But since the inputs are integers (because they are multiplicities), you also have to assume entropy is monotonically increasing with respect to the multiplicity. In the end he shows that entropy is indeed logarithmic as expected.
I won't go into the gory details because it's quite involved and I think it's a bit too technical to gain that much more intuition beyond the two derivations above. Please do check out [1] though if you're interested, it's always a pleasure reading Jaynes.
6 Conclusion
Well I'm glad I got that post out of the way. As soon as I read that appendix in [3], I knew I had to write about the derivation. Along the way I found Jaynes' derivation in [1], which upon closer inspection also included the Wallis derivation. As with every topic, you can go down an unlimited depth (and this one is a deep dive on an already "elementary" topic). For now, I'm satisfied with just explaining two derivations, which give me a better appreciation for the beauty and "surprise" of (maximum) entropy. Stayed tuned for more (short to medium sized) posts!
7 References
[1] E. T. Jaynes, "Probability Theory: The Logic of Science", Cambridge, 2006.
[2] Wikipedia: Principle of Maximum Entropy
[3] Dill, K. A., & Bromberg, S. (2011). Molecular Dynamics (Appendix E). CRC Press.