Learning Lagrangians from Data

A few years ago, I watched in awe as Michael Schmidt gave a presentation on building a robot scientist based on his much-cited paper, “Distilling Free-Form Natural Laws from Experimental Data”. However, what wasn’t clear to me was how his algorithm discovered Lagrangians “without prior knowledge of physics”.

Fastforward several years in my maths degree and several mathematical physics courses; I decided to read this paper a second time. Upon a second reading, it became clear to me that M. Schmidt and H. Lipson had to use a fitness function that directly incorporated knowledge of Lagrangians in order to find the variational ODEs of interest. In fact, in a carefully argued paper, researchers at MSRI provide a reasonable candidate fitness function for deducing Lagrangians from data as follows[2]:

  1. \mathcal{L} of a system with generalised coordinates x_i, \dot{x_i} (i=1,...,d) solves the Euler-Lagrange equations:

    \frac{d}{dt} (\frac{ \partial \mathcal{L}}{ \partial \dot{x_i}}) -\frac{\partial \mathcal{L}}{\partial x_i}=0

  2. By the chain rule \forall t \in [0,T] , we may develop this equation into:

    EL_i(\mathcal{L},t) := \sum\limits_{j=1}^{d} \frac{ \partial^2 \mathcal{L}}{ \partial x_j \partial x_i}\frac{d x_j}{dt} +\sum\limits_{j=1}^{d} \frac{ \partial^2 \mathcal{L}}{ \partial \dot{x_j} \partial \dot{x_i}}\frac{d \dot{x_j}}{dt} \frac{\partial \mathcal{L}}{\partial x_i}=0

  3. From this, Hillar and Sommer derive the following plausible fitness function:

    LFit(f) := -\frac{1}{N} \sum\limits_{k=1}^{N} log(1+\sum\limits_{j=1}^{d}|EL_i(\mathcal{L},t)|)

Meanwhile, we still have some questions to answer:

  1. What’s a reasonable form for candidate Lagrangians?
  2. Does our fitness function place sufficient constraints on the Lagrangian?

I have no doubt that Hillar & Sommer probably had reasonable answers to these questions but it’s D. Hillis et al. who provide a thorough response in their easy-to-read paper[1]. To the first question, Taylor’s theorem provides a reasonable answer:

  1. There exists an approximate polynomial representation for any Lagrangian, and we may assume that the Lagrangian may be represented by a polynomial in the coordinates and velocities.
  2. In order to place bounds on memory and computation, Hillis et al. use a reduced polynomial representation. i.e. there is a maximum power(m) of any coordinate for any coordinate/velocity and there’s a maximum degree(p) for any combination of coordinates and velocities.
  3. As an example, for a single variable \theta with m=2,p=3 we have:

    \mathcal{L}(\theta,\dot{\theta}) = c_1 \dot{\theta}^2+c_2 \theta+c_3 \theta^2+c_4 \theta \dot{\theta}^2 

Using Nelder-Mead to minimise a carefully chosen fitness function in terms of the coefficients c_i , Hillis et al. manage to find suitable Lagrangians for a variety of simple physical systems including the Duffing oscillator, Penning trap and Coupled harmonic oscillator. In fact, they go beyond using polynomial representations and in their Generalisation section explain how they use genetic programming to evolve complex functions based on simple functions(sine,cosine…etc.) and use an appropriate search algorithm that optimises the fitness score and also minimises the size of the ‘phylogenetic trees’.

From the results provided by M. Schmidt & H. Lipson, it appears that they used a very similar approach. However, unlike Hillis, their paper is very badly written and they don’t share the code they used. At best it might be considered a piece of marketing which would later be used to boost M. Schmidt’s data mining company, Nutonian.

Deception aside, I believe that it’s possible to produce innovative research in this area if researchers continue in the tradition of Hillis and Hillar. In particular, I think it might be possible to develop algorithms that find variational PDEs that describe a system’s behaviour and then use these variational PDEs to automatically classify physical systems. This, I believe, would be an important contribution to the physics community.

Note 1: It’s quite easy to search for Lagrangians with a restricted polynomial representation using Python with Scipy and autograd.

Note 2: Hillis et al. made their code available on Github. Although I don’t use Clojure, I learned a lot from going through the structure of their codebase.


1.D. Hillis et al. “An algorithm for discovering Lagrangians automatically from data” PeerJ 20. October 2015
2.  C. Hillar et al. “Comment on the article “Distilling Free-Form Natural Laws from Experimental Data”” Arxiv. 26 October 2012
3. M. Schmidt and H. Lipson.”Distilling Free-Form Natural Laws from Experimental Data” Science. 3 April 2009

Are wheels optimal for locomotion?

Given the large number of wheeled robots coming out of modern factories it’s natural to ask whether wheels are near-optimal for terrestrial locomotion. I believe this question may be partially addressed from a physicists’ perspective although the general problem of finding near-optimal robot models appears to be intractable.


A reasonable scientific approach to this problem would involve a laboratory setting of some sort. The scientist may setup an inclined ramp and analyse the rolling motion of different compact shapes with the same mass and volume. In particular it’s natural to ask: Of all the possible shapes having the same material, mass and volume which shape reaches the bottom of the ramp first when released from the top of the ramp?

If we may neglect dissipative forces, we may determine the time for a disk to reach the bottom of the ramp as follows:

  1. By the parallel axis theorem, we have:

    \begin{aligned} \dot{w} = \frac{\tau}{I}=\frac{mgrsin\theta}{0.5 m r^2} \end{aligned}

    where \theta is the ramp’s angle of inclination, m is the mass of the disk and r is the radius of the disk.

  2. As a result, the linear acceleration of the disk is constant:

    \begin{aligned} \ddot{x}= 2gsin\theta \end{aligned}

  3. From this we deduce that given a ramp with length l , the total time to reach the bottom of the ramp is given by:

    \begin{aligned} T = 2\sqrt{\frac{l}{gsin\theta}} \end{aligned}

This analysis, although useful, is difficult to replicate for other shapes as the moment of inertia will actually depend on the axis chosen. Further, beyond the special case of certain symmetric bodies the total time taken to reach the bottom of the ramp can no longer be determined in an analytic manner. Barring a mathematical method I’m unaware of, we would need to simulate the motion of each body rolling/bouncing down the incline with high numerical precision and then perform an infinite number of comparisons.


At this point it may seem that our task is impossible but perhaps we can make progress by asking a different question…Of all the possible shapes that move down the incline, which have the most predictable and hence controllable motion? The answer to this is of course the spherical bodies(ex. the wheel) and it’s for this reason that engineers build cars with wheels to roll on roads. A different shape that reached the bottom of the incline sooner than the wheel would be useless for locomotion if we couldn’t determine where it would land. For these reasons, our mathematical conundrum may be safely ignored.

In contrast, as natural terrain tends to be highly irregular, animal limbs must be highly versatile. I have yet to come across wheels that simultaneously allow grasping, climbing, swimming and jumping. Furthermore, even if there were a large region of land that was suitable for wheeled locomotion, these animals would inevitably be confined to this region and this situation would be highly detrimental to their ability to explore new terrain.

Given these arguments, it’s clear that wheeled robots aren’t well suited for terrestrial locomotion. Having said this, can we build robots that are near-optimal for terrestrial locomotion? What would we mean by optimal?


I have thought more broadly about the possibility of using computers to design near-optimal robots for terrestrial locomotion but the problem appears to be quite hard. Given that the cost function isn’t well-defined the search space will be very large. It would make sense to place a prior on our cost function and try to use a method similar to Bayesian Optimisation. However, the task entails a method much more sophisticated than the Bayesian Optimisation methods that are used in industry today.

I’ll probably revisit this challenge in the future but I think it would make sense to first solve an intermediate challenge in Bayesian Optimisation applied to robotics.

perturbations that preserve constants of motion

simulation of three body problem with random initial conditions


Within the context of the gravitational n-body problem I’m interested in perturbations of the initial conditions  \{p_i(0),\dot{p}_i(0)\}_{i=1}^n    which leave all the constants of motion unchanged.

It’s clear to me that the linear momentum(P) is preserved under any transformation of the initial position vectors and energy(H) is preserved under any isometry applied to initial position vectors. However, when I add the constraint of conserving angular momentum(L), the general nature of these transformations which would leave (H,L, P) unchanged is not clear.

Granted, time translations would preserve the constants of motion but in a trivial manner. Now, I’m not sure whether the approach I’ve taken can be considerably improved but within the context of the three body problem in the plane, I transformed this problem into an optimisation problem as follows:

  1. We first calculate the constants of motion which are the linear momentum, energy and the angular momentum prior to perturbing the position of one of the n bodies. Let’s suppose that these constants are given by C_1, C_2, C_3 . 
  2. Assuming our bodies are restricted to move in the plane, there are a total of 4n scalars that determine the initial conditions for this n-body system. It follows that the perturbation of one body(q ) generally requires searching for a 4n-2  dimensional vector x  for positions and velocities. This can be done using Nelder-Mead, a derivative-free method, provided that we define a cost function. 
  3. The functions to be minimised are obtained directly from the linear momentum (P), angular momentum (L) and Hamiltonian (H) equations to obtain a single smooth cost function:\begin{aligned} C(x,q) =(P(x,q)-C_1)^2+(H(x,q)-C_2)^2+(L(x,q)-C_3)^2 \end{aligned}
  4. In order to increase the probability of convergence, the Nelder-Mead algorithm was randomly seeded for each perturbed position q .

I think this method can be easily extended to n-bodies.

point masses and Gauss’ Law of Gravity

My final year project involves searching for stable orbits for the gravitational n-body problem where we use Newton’s point mass approximation to describe the force field around masses assuming that these are approximately spherical. Now given that this model uses point masses we can’t model rotations. In fact, it’s generally assumed that they are negligible. 

However, I wondered whether this assumption can break down at some point. Consider the following:

  1. On the scale of millenia, the surface of our planet actually behaves like a fluid.
  2. If the distance between two planetary bodies is small and we assume zero rotation, this would imply a growing deviation from sphericity over time. It follows that the point-mass approximation would no longer make sense.
  3. In fact, our bodies would look something like blobs of gooey paint slowly dripping towards each other. 

While it’s true that the probability of zero rotation is close to zero the challenge of dealing with non-spherical masses is something that Newton wouldn’t have handled very well with his point-mass approximation. However, this is not quite the end of Newton’s model. 

Given that the gravitational potential in his model holds for point masses a natural solution would be to integrate over the volume enclosed by a compact body. This is exactly what Gauss did when he discovered the divergence theorem from which the Gauss Law of Gravity emerges as an immediate corollary. As this handles the case where our masses aren’t spherical the rest of my blog post will focus on the statement and proof of this theorem. 

Gauss Law of Gravitation:

Given U  as the gravitational potential and \vec{g} = -\nabla U  as the gravitational field, Gauss’ Law of Gravitation states:

\begin{aligned} \int_{V} (\nabla \cdot \vec{g}) dV =\int_{\partial V} \vec{g} \cdot d \vec{A}  \end{aligned}

In order to arrive at this result, I’ll first need to introduce two useful definitions:

flux:The flux \phi of a vector field through a surface is the integral of the scalar product of the field in a given point times the infinitesimal oriented surface at that point:

\begin{aligned} \phi =  \int_{\partial V} \vec{g} \cdot d \vec{A} \end{aligned}

where \vec{g} denotes the gravitational field, \partial V the boundary of our mass and d \vec{A} is our infinitesimal oriented surface.

divergence:The divergence represents volume density of the outward flux of a vector field from an infinitesimal volume around a given point:

\begin{aligned} \nabla \cdot \vec{g} \end{aligned}

The flux and divergence of a field are actually related in a local manner. Consider a box with dimensions \delta x_1,\delta x_2,\delta x_3 . Using Cartesian coordinates, let’s suppose that (x_1,x_2,x_3) denotes a corner on this box. Then the flux of \vec{g} projected onto \hat{x_3} is given by:

\begin{aligned} \phi_3 =(g_3(x_1,x_2,x_3+\delta x_3)-g_3(x_1,x_2,x_3)) \delta x_1 \delta x_2 \end{aligned}

For very small \delta x_3 ,

\begin{aligned} g_3(x_1,x_2,x_3+\delta x_3)-g_3(x_1,x_2,x_3) \approx \frac{\partial g_3}{\partial x_3} \delta x_3 \end{aligned}

where we get equality in the limit as \delta x_3 tends to zero.

As a result we have \forall i \in \{1,2,3 \}  :

\begin{aligned} \phi_i \approx \frac{\partial g_i}{\partial x_i} \delta x_1 \delta x_2 \delta x_3 =\frac{\partial g_i}{\partial x_i} \delta V  \end{aligned}

\begin{aligned} \phi =\sum_{i=1}^{3} \phi_i \end{aligned}

Now, we may deduce that the flux and divergence are related as follows:

\begin{aligned} \frac{\phi}{\delta V}\approx  \sum_{i=1}^{3} \frac{\partial g_i}{\partial x_i} \end{aligned}

\begin{aligned}\lim_{\delta V \to 0} \frac{\phi}{\delta V} = \nabla \cdot \vec{g} =  \sum_{i=1}^{3} \frac{\partial g_i}{\partial x_i} \end{aligned}

If we take an arbitrary solid and slice it into compact volumes V_i with associated surfaces \partial V_i we have:

\begin{aligned} \phi =\sum_{i=1}^{N} \phi_i =\sum_{i=1}^{N} \int_{\partial V_i} \vec{g_i} \cdot d \vec{A} =\int_{\partial V} \vec{g} \cdot d \vec{A} \end{aligned}

In addition, using the relationship between flux and divergence we have:

\begin{aligned} \phi =\lim_{N \to \infty} \lim_{V_i \to 0} \sum_{i=1}^{N} \nabla \cdot \vec{g_i} V_i = \int_{V} \nabla \cdot \vec{g} dV \end{aligned}

By equating the last two expressions we conclude the proof of this marvellous result by Gauss which is succinctly summarised on Wolfram as follows:

in the absence of the creation or destruction of matter, the density within a region of space can change only by having it flow into or away from the region through its boundary

At this point it’s natural to ask how can we describe the gravitational field which is defined in terms of the gravitational potential. In general, for non-spherical masses we can no longer use the usual inverse radius gravitational potential due to Newton.

In some sense we’ll need an infinite sum of perturbations in the gravitational potential which can be done using spherical harmonics, another wonderful mathematical invention that has many applications. This shall be the subject of my next blog post. 



dynamics of a rotating coin

Sometimes when I’m bored in a waiting area, I would take a coin out of my pocket and spin it on a table. I never really tried to figure out what was going on. But, recently I wondered about a few things:

1. Can I determine the critical rotational speed that determines the instant when vertical displacement is involved?
2. Assuming that the initial rotational impulse is known, can I predict the exact location where the coin lands?
3. Can two coins with the same dimensions but different mass stop spinning at the same instant assuming that they both receive the same rotational impulse?

More precisely :

We assume that a coin of dimensions R and height h  where R >> h is initially standing on a flat surface prior to receiving a rotational impluse in a plane parallel to the surface. Further, we assume that the coin has mass M , with uniform mass distribution. Empirically, I observed that there are two regimes:

1. Sliding contact with the flat surface, while the rotational speed ||\dot{\theta}||\geq c where c is a constant which can be determined.
2. Rolling contact without slipping, when ||\dot{\theta}|| < c

Here’s my reasoning so far:

1. Assuming small rotational speeds, which is reasonable, drag is proportional to the first power of rotational speed:

\begin{aligned} F_D = C_d \dot{\theta} \frac{4 \pi}{3 \pi} \end{aligned}

where C_d is the drag coefficient and \frac{4 \pi}{3 \pi} is the distance of the centroid of the semicircular half of the coin from the coin’s center of gravity. Now the work done by the F_D is proportional to the distance travelled by the centroid on both halves of the coin, so the total energy dissipated at time t is given by:

\begin{aligned} \Delta E(\theta, \dot{\theta},t) = 2 \int_{0}^{t} F_D \theta \frac{4 \pi}{3 \pi}= 2 C_d \big(\frac{4 \pi}{3 \pi}\big)^2 \int_{0}^{t} \dot{\theta} \theta dt= C_d \big(\frac{4 \pi}{3 \pi}\big)^2 {\theta (t)}^2 \end{aligned} 

So the energy dissipated is just an explicit function of the angle \theta and the Hamiltonian is given by:

\begin{aligned} H(\theta, \dot{\theta}) = \frac{1}{2} I \dot{\theta}^2+mg\frac{h}{2}-\Delta E(\theta) \end{aligned} 

The equations of motion are then given by:

\begin{aligned} \dot{Q}=\frac{\partial H}{\partial \dot{\theta}} = I \dot{\theta} \end{aligned} 

\begin{aligned} \dot{P}=-\frac{\partial H}{\partial \theta} = 2 C_d(\frac{4R}{3 \pi})^2 \theta(t) \end{aligned} 

2. It’s not clear to me how I should interpret the equations of motion but my hunch is that the first phase has ended when the kinetic energy vanishes. This happens when the dissipated energy equals the kinetic energy:

\begin{aligned} \frac{1}{2} I \dot{\theta}^2=C_d \big(\frac{4 \pi}{3 \pi}\big)^2 {\theta (t)}^2 \end{aligned} 

If we let C_1 = \frac{I}{2} and C_2 = C_d \big(\frac{4 \pi}{3 \pi}\big)^2 , the solutions I found are of the form:

\begin{aligned} \frac{\dot{\theta}}{\theta} = \sqrt{\frac{C_2}{C_1}} \end{aligned}

But, this is problematic as the solution is meant to be a unique \theta . However, I assume that this difficulty can be resolved and I think it might be due to a problem that occurred earlier.

Now, it remains to explain why the coin enters a second phase and begins to roll without slipping instead of simply stopping completely. One reason has to do with experimental conditions. In practice the surface on which the coin spins is never completely flat, and the initial rotational impulse is never completely planar. A second reason is the minimum total potential energy principle which states that a system tries to minimise its total potential energy in order to maximise stability.

3. I could try to explain what happens in the second phase where the coin is rolling without slipping but there are already signs that my mathematical arguments for the first phase might have an error.

However, assuming that there are no dissipative forces the total energy is given by:

\begin{aligned} E = MgRsin(\alpha) + \frac{1}{2}I \Omega^2 sin^2(\alpha) \end{aligned}

where \alpha is the angle of inclination with respect to the vertical and \Omega is the precession rate. From this point I can go a bit further but I think this is a good place to stop for now.

Remark : I have also asked the first two questions on the physics stackexchange and I look forward to having a precise answer to all three questions by the end of this week.

References :

1. http://puhep1.princeton.edu/~kirkmcd/examples/rollingdisk.pdf
2. http://www.eulersdisk.com/PRE56610.pdf

Derivation of isoperimetric inequality from ideal gas equation

As I was going through my notes on statistical physics it occurred to me that from a classical perspective, I could derive the isoperimetric inequality from the ideal gas equation and its associated Maxwell distribution. If you’re unfamiliar with the isoperimetric inequality it basically states that for a given Jordan curve C  with fixed perimeter, the circle maximises the internal area. So my assertion is basically equivalent to saying that an ideal gas maximises the enclosed volume.

More formally for any compact Euclidean manifold M in \mathbb{R}^3 with elastic boundary \partial M containing a macroscopic number of gas particles, a gradual increase in temperature ultimately results in a sphere. After showing that the volume must be bounded as a result of the Whitney-Loomis inequality, my method of proof is to take for granted that \partial M allows a maximum pressure P^* and that if we gradually increase the temperature T of the particles inside M ,

P^* everywhere on \partial M  implies M is spherical

So far I have a semi-rigorous mathematical argument which can be summarised as follows. Assuming that P^* has been attained uniformly on \partial M , the magnitude of the force on \partial M must be approximately constant all over \partial M . In consequence, by the Maxwell-Boltzmann distribution and momentum conservation, the distance between opposite extremities of the balloon must be approximately constant.

A more complete account of what I’ve attempted so far can be found here in my essays section. Meanwhile, despite numerous google searches, it appears that this result hasn’t been rigorously proven by mathematical physicists so far. In fact, I asked the question on the mathoverflow and from the responses it’s clear that the result isn’t obvious. It might require very careful averaging arguments.

In fact, it might interest the reader to know that the isoperimetric inequality itself eluded many of the greatest mathematicians in history including Euler until the efforts of Jakob Steiner in the 19th century finally led to its proof.

microcanonical distributions

I’m currently trying to go through the material for my statistical physics course. For this I decided to start with an introductory text, ‘Statistical Physics’ by Daijiro Yoshioka. It’s very good for developing an intuition about the subject but many important results are stated without proof. So I have to try and fill in the gaps myself. Below is my attempt to prove one of these results.

The scenario where every possible microscopic state is realised with equal probability is called the microcanonical distribution. Now, the author states without proof that in this scenario almost all microscopic states will be realised as the state of the system changes temporally. We are given the following facts:

1. We consider a system of solid, liquid, or gas enclosed by an adiabatic wall.
2. We assume fixed Volume(V ),fixed number of molecules (N ), and fixed but total energy(E ) with uncertainty \delta E .
3. The total number of microscopic states allowed under the macroscopic constraints is given by

\begin{aligned} W=W(E,\delta E,V,N) \end{aligned}  

Assuming that each micro-state is realised with equal probability, we have for any micro-state m_i :

\begin{aligned} P(m_i | t_n) = \frac{1}{W} \implies \sum_{n=1}^{\infty}P(m_i | t_n)=\infty \end{aligned}  

where we assume that the total number of micro-states is finite. We can then show using the Borel-Cantelli theorem that for any m_i, E_n=\{m_i | t_n\} occurs infinitely often.

There’s something that bothers me about my proof. Somehow it assumes that the probability of transition between any two states, no matter how different, is always the same. For this reason, I now think that this proof might require more than the Borel-Cantelli theorem and that the ‘equal probability’ assumption might not hold for state transitions. It seems reasonable that in the immediate future some transitions would be more likely than others. Maybe I can use Borel-Cantelli for ‘large-enough’ time intervals.

Assuming that the idea of ‘large-enough’ time intervals can work, if I define a function f which maps time intervals to microscopic states I can construct a sequence of weakly correlated events that’s Lebesgue integrable:

\begin{aligned}  E_t = \{m_i \in f(I_t) \} \end{aligned}  

where some of the intervals I_t might be degenerate.`

I wasn’t taught this variant in my measure theory class but after some reflection I managed to fill in the details. This integrable variant will be the subject of my next blog post.

Two interesting questions to consider after this one, assuming I resolve this question soon, are whether the expected time to cover 50% or more of the state-space is finite and if so whether the rate at which the state-space is ‘explored’ is constant. My intuition tells me that the rate would probably be given by a sigmoid curve(i.e. the rate is decreasing).