The Calculus of Variations
This post is going to describe a specialized type of calculus called variational calculus. Analogous to the usual methods of calculus that we learn in university, this one deals with functions of functions and how to minimize or maximize them. It's used extensively in physics problems such as finding the minimum energy path a particle takes under certain conditions. As you can also imagine, it's also used in machine learning/statistics where you want to find a density that optimizes an objective 1. The explanation I'm going to use (at least for the first part) is heavily based upon Svetitsky's Notes on Functionals, which so far is the most intuitive explanation I've read. I'll try to follow Svetitsky's notes to give some intuition on how we arrive at variational calculus from regular calculus with a bunch of examples along the way. Eventually we'll get to an application that relates back to probability. I think with the right intuition and explanation, it's actually not too difficult, enjoy!
From multivariable functions to functionals
Consider a regular scalar function \(F(y)\), it maps a single value \(y\) to a single value \(F(y)\). You can differentiate it to get \(\frac{dF}{dy} = F'(y)\). Another way to think about it is starting at \(y_1\), move a tiny step away, call it \(dy\), then \(F\) will change by its differential:
Now let's consider the same situation with a multivariable function \(F(y_1, y_2, \ldots, y_n)\). It maps a set of values \(y_1, \ldots, y_n\) to a single value \(F(y_1, y_2, \ldots, y_n)\). You can also differentiate it by taking partial derivatives: \(\frac{dF}{dy_1}, \ldots, \frac{dF}{dy_n}\). Similarly, if I move a tiny step away in each direction \(dy_1, \ldots, dy_n\), starting from points \(y_1^1, \ldots, y_n^1\), the total differential of the function is:
So far we have only touched upon derivatives and differentials of a function \(F(y_1, y_2, \ldots, y_n)\) with independent variables \(y_1, y_2, \ldots, y_n\). If you squint hard enough, you may already start to see where we're going with this: we can view the input to the function not just as independent variables, but collectively as a group of variables derived from a function \(y\) (i.e. \(y_i = y(i), i \in \mathcal{N}\)). If we have an infinite set of these variables, then \(F\) is no longer just a function of independent variables but a function of a function \(y(i)\).
Of course this is just some intuition and not at all precise but let's keep going to see where this can take us. Our independent points can be seen as sampling some function \(y\) on some interval \([a, b]\). Let's say we want to sample \(N\) points, each \(\epsilon\) distance apart, i.e. \(N\epsilon = b  a\). The \(n^{th}\) point will be at \(x_n=a + n\epsilon\), so our sample point becomes \(y_n = y(x_n) = y(a + n\epsilon)\). As \(N \rightarrow \infty, \epsilon \rightarrow 0\), our sampled points \(y_n\) become an increasingly accurate representation of our original function \(y(x)\).
Our original "function" \(F\) can now be thought of as a function of a set of variables \(\{y_n\}\) resulting in \(F(\{y_n\})\). As \(N \rightarrow \infty\), the original function \(F\) transforms into a function of an infinite set of variables, or in another way, a function of a function. The mapping is called a functional, which we generally write with square brackets \(F[y]\). So instead of taking a fixed number of independent variables and outputting a value, it takes an infinite number of variables, defined by \(y(x)\) on the interval \([a,b]\), and outputs a value!
Let's pause for a second and summarize:
A function \(y(x)\) takes as input a number \(x\) and returns a number.
A functional \(F[y]\) takes as input a function \(y(x)\) and returns a number.
Another way to say this is \(F[y(x)]\) takes all the values of \(y(x)\) at all the values of \(x\) (for that \(y\)) and maps it to a single number. Let's take a look at a few examples to make things concrete.
Example 1: A simple functional
The simplest functional just evaluates the input function at a particular value.
Define \(F[y] = y(3)\).
\(F[y=x] = 3\)
\(F[y=x^2] = (3)^2 = 9\)
\(F[y=\ln_3(x)] = \ln_3(3) = 1\)
Example 2: An integral functional
Many useful functionals will take a definite integral of the input function as a means to map it to a number.
Define \(F[y] = \int_0^1 y(x) dx\).
\(F[y=x] = \int_0^1 x dx = \frac{1}{2}\)
\(F[y=x^2] = \int_0^1 x^2 dx = \frac{1}{3}\)
\(F[y=e^x] = \int_0^1 e^x dx = e  1\)
Example 3: Functionals with derivatives
Since we're dealing with functions as inputs, the functional can also involve the derivative of an input function.
A functional that defines the arc length of a curve:
Notice that we're using the derivative of \(y(x)\) instead of the function itself. Solving for \(a=0, b=1, y=x\), we get:
A common form of functionals that in appear in many contexts is:
Which is mostly just saying that \(y(x)\) is well behaved over \(x\in [a,b]\). In more detail, we want these conditions to be satisfied for any \(y(x)\) we plug in: \(y(x)\) being a singlevalued function, smooth so that \(y'(x)\) exists as well as the integral defined in Equation 3, and the boundary conditions (\(x=[a,b], a\leq b, y(a)=\hat{y}_a, y(b)=\hat{y}_b\)) are satisfied.
Functional Derivatives
Now it's finally time to do something useful with functionals! As with regular calculus, whose premier application is finding minima and maxima, we also want to be able to find the extrema of functionals. It turns out we can define something analogous to a derivative unsurprisingly called a functional derivative. Let's see how we can intuitively build it up from the same multivariable function from above.
Let's take another look at the total differential in Equation 2 again, but rewrite it this time as a sum:
Again, if you squint hard enough, as \(N \rightarrow \infty\), \(y_i\) approximates our \(y(x)\), and our sum turns into an integral of a continuous function (recall the domain of our function \(y(x)\) was \(x \in [a,b]\)).
The meaning of Equation 5 is the same as Equation 4: a small change in \(F\) is proportional to a sum of small changes in each direction \(\delta y(x)\) (step size) multiplied by the derivative for each direction \(\frac{\delta F}{\delta y(x)}\) (slope), where we can think of \(x\) as a continuous index (analogous to \(i\)). As a result, the functional derivative is defined by:
This is analogous to the derivative at each of the "independent variables" \(y(x)\), which is the same concept as the gradient for multivariate functions.
Equation 5 then becomes a directional derivative, where we can interpret as the rate of change of \(F\) as we are moving through "point" \(y^1(x)\) in the direction of \(\delta y(x)\) (check out this tutorial on directional derivatives for a good intuitive refresher on the subject).
This explanation takes us from gradients to functional derivatives but we can also define it in terms of limits. Using the analogy of directional derivatives from above, if we have the functional derivative at the multivariate "point" \(y(x)\) moving in the multivariate "direction" of an arbitrary function \(\eta(x)\) then we can formulate the limit as:
which, if you think hard enough about, results in the same integral as Equation 5. This also means the functional derivative is a function (natural extension from the multivariate point analogy where the number of points is infinite within an interval). We also define \(\delta y\) as \(\epsilon \eta(x)\) and call it the variation of \(y\).
Of course, there's no guarantee that the functional derivative exists. That's where formal definitions and rigorous mathematics comes in, which is beyond the scope of this post. Also important to mention is that we can have higher order functional derivatives that can be defined in a very similar way. For now, let's just focus on simple cases where everything plays nicely.
Why the name variational calculus?
A variation of a functional is the small change in a functional's value due to a small change in the functional's input. It's the analogous concept to a differential for regular calculus.
We've already seen an example of a variation in Equation 5, which is the first variation of the functional \(F\):
As mentioned above the term \(\epsilon \eta(x)\) is also called a variation of input \(y\), which is analogous to the infinitesimally small \(\epsilon\) in regular calculus.
The first variation and higher order variations define the respective functional derivatives and can be derived by taking the coefficients of the Taylor series expansion of the functional. More details can be found here Advanced Variational Methods In Mechanics Chapter 1: Variational Calculus Overview.
Example 4: Computing a simple functional derivative
Let's try to find the functional derivative of a simple functional:
We can calculate this by going back to Equation 5 and computing its first variation: \(dF = F[y + \delta y]  F[y]\). Start by computing \(F[y + \delta y]\):
From Equation 10, we know that in the limit \(\delta y \rightarrow 0\) so we can drop the last term. Finally, computing \(dF\) we have:
By inspection, we can see Equation 11 resembles Equation 5, thus our functional derivative is \(\frac{\delta F}{\delta y(x)} = 2y(x)\).
EulerLagrange Equation
Now armed with the definition of a functional derivative, we now know how to compute it from first principles. However, as with regular calculus, computing a derivative by definition can get tedious. Fortunately, there is a result that can help us compute the functional derivative called the EulerLagrange equation, which states (roughly):
For a given function \(y(x)\) with a real argument \(x\), the functional:
\begin{equation*} F[y] = \int_a^b L(x, y(x), y'(x)) dx \tag{12} \end{equation*}has functional derivative given by:
\begin{equation*} \frac{\delta F}{\delta y(x)} = \frac{\partial L}{\partial y}  \frac{d}{dx} \frac{\partial L}{\partial y'} \tag{13} \end{equation*}
You can derive this equation using the method we used above (and a few extra tricks) but I'll leave it as an exercise :) For simple functionals like Equation 12, this is a very handy way to compute functional derivatives. Let's take a look at a couple more complicated examples.
Example 5: Use the EulerLagrange Equation to find the functional derivative of \(F[y(x)] = \int_0^1 x^3 e^{y(x)} dx\)
Notice the second term in Equation 13 involves only \(y'\), which we doesn't appear in our functional so that means it's 0. Thus, the functional derivative in this case just treats \(y\) as a variable and the usual rules of differentiation apply:
Example 6: Use the EulerLagrange Equation to find the functional derivative of \(F[y(x)] = \int_0^1 x^2 y^3 y'^4 dx\)
Here we have to use all terms from Equation 13 and treat \(y\) and \(y'\) as "independent" variables:
Extrema of Functionals
As an exercise this is interesting enough, but the real application is when we want to minimize or maximize a functional. In a similar way to how we find a point that is an extremum of a function, we can also find a function that is an extremum a functional.
It turns out that it's pretty much what you would expect: if we set the functional derivative to zero, we'll find a stationary point of the functional where we possibly have a local minimum or maximum (i.e. a necessary condition for extrema, sometimes we might find a saddle point though). In other words, this is a place where the "slope" is zero. Let's take a look at a classic example.
Example 7: Find the shortest possible curve between the points \((a,c)\) and \((b,d)\) for which the path length along the curve is defined by \(L(f) = \int_a^b \sqrt{1 + f'(x)^2} dx\)
First define the integrand of our functional:
where \((x, y, y') = (x, f(x), f'(x))\). Precomputing the partial derivatives of \(L\):
Plugging them into Equation 13, we can simplify the resulting differential equation:
where we introduce a constant \(C\) after integrating, define a new constant \(A\) to be the result of a constant expression with \(C\), and introduce a new constant \(B\) from the second integration. As you would expect the shortest distance between two points is a straight line (but now you can prove it!).
We can find the actual values of constants \(A\) and \(B\) by using our initial conditions \((a,c)\) and \((b,d)\) since we know the function has to pass through our points (i.e. compute the slope and intercept of the line):
Now that we have a method to solve the general problem of finding extrema for a functional, we can add constraints in the mix. As you may have guessed, we can use the concept of Lagrange multipliers here (see my previous post on Lagrange Multipliers).
Given a functional in the form of Equation 12, we can add different types of constraints. The simplest type of constraint we can have is a functional constraint of the form:
In this case, the solution resembles the usual method for Lagrange multipliers. We can solve this problem by building a new functional in the same vein of a Lagrangian:
Using the EulerLagrange equation, we can solve Equation 21 for a function \(y\) and constant \(\lambda\), keeping in mind we are given boundary conditions at \(a\) and \(b\) as well as Equation 20 to help us solve for all the constants. This method also naturally extends to multiple constraints as you would expect.
The other type of constraint is just a constraint on the actual function (similar to regular Lagrange multipliers):
Here, a Lagrange multiplier function needs to be introduced and the Lagrangian becomes:
Again, we can use the EulerLagrange equation to solve Equation 23, except we'll get a system of differential equations to solve (you need to take the functional derivative with respect to both \(y(x)\) and \(\lambda(x)\)).
And at long last, we can finally get to solving some interesting problems in probability! Let's take a look at a couple of examples of finding maximum entropy distributions under different constraints (check out my previous post on Maximum Entropy Distributions).
Example 8: Find the continuous maximum entropy distribution with support \([a,b]\).
This is actually the same example as that appeared in my post on Maximum Entropy Distributions, but let's take another look.
First since we're finding the maximum entropy distribution, we define the differential entropy functional (we use \(H\) here to denote entropy):
Next, we define a functional constraint that our density must sum to 1:
Now put together the Lagrangian equivalent:
Using the EulerLagrange equation to find the maximum and noticing we have no derivatives of \(f(x)\), we get:
Plugging this into our constraint in Equation 25:
Now substituting back into Equation 27, we get:
This is nothing more than a uniform distribution on the interval \([a,b]\). This means that given no other knowledge of a distribution (except its support), the principle of maximum entropy says we should assume it's a uniform distribution.
Example 9: Find the continuous maximum entropy distribution with support \([\infty,\infty]\), \(E[x] = \mu\) and \(E[(x\mu)^2] = \sigma^2\).
You may already be able to guess what kind of distribution we should end up with when we have the mean and variance specified, let's see if you're right.
We'll just transform the variance constraint into the second moment to make this a bit more symmetric:
Given this objective functional and associated constraints:
we can put together the Lagrangian functional:
Using the EulerLagrange equation again and setting it to 0:
Now this is not going to work out so nicely in terms of plugging it back into our constraints from Equation 30 because integrals involving \(e^{x^2}\) usually don't have nice antiderivatives. But one thing to notice is that this is basically the form of a Gaussian function (you'll have to do some legwork to complete the square though):
Further, we know from the constraints in Equation 31 that the function is normalized to \(1\), making this a normal distribution. Thus we can determine the values of the missing coefficients by just matching them against the definition of a normal distribution:
So by the principle of maximum entropy, if we only know the mean and variance of a distribution with support along the real line, we should assume the distribution is normal.
Conclusion
For those of us who aren't math or physics majors (* cough * computer engineers), variational calculus is an important topic that we missed out on. Not only does it have a myriad of applications in physical domains (it's the most common type of problem when searching for "variational calculus"), it also has many applications in statistics and machine learning (you can expect a future post using this topic!). As with most things, once you know enough about the individual parts (multivariable calculus, Lagrange multipliers etc.) the actual topic (variational calculus) isn't too much of a stretch (at least when you're not trying to prove things formally!). I hope this post helps all the nonmathematicians and nonphysicists out there.
Further Reading
Previous Posts: Lagrange Multipliers, Max Entropy Distributions
Wikipedia: Calculus of Variations, Functional Derivative, Directional Derivative, Differential of a function, Lagrange multipliers
Directional Derivatives, Paul Dawkins, Paul's Online Math Notes.
What is the practical difference between a differential and a derivative?, Arturo Magidin, Math.Stack Exchange.
"Notes on Functionals", B. Svetitsky
"Advanced Variational Methods In Mechanics", Chapter 1: Variational Calculus Overview, University of Colorado at Boulder
Variational Problems, VietnameseGerman University.
 1

As you have probably guessed, this is the primary reason I'm interested in this area of mathematics. A lot of popular ML/statistics techniques have the word "variational", which they get because they are somehow related to variational calculus.