Learning Lagrangians from Data

A few years ago, I watched in awe as Michael Schmidt gave a presentation on building a robot scientist based on his much-cited paper, “Distilling Free-Form Natural Laws from Experimental Data”. However, what wasn’t clear to me was how his algorithm discovered Lagrangians “without prior knowledge of physics”.

Fastforward several years in my maths degree and several mathematical physics courses; I decided to read this paper a second time. Upon a second reading, it became clear to me that M. Schmidt and H. Lipson had to use a fitness function that directly incorporated knowledge of Lagrangians in order to find the variational ODEs of interest. In fact, in a carefully argued paper, researchers at MSRI provide a reasonable candidate fitness function for deducing Lagrangians from data as follows[2]:

  1. \mathcal{L} of a system with generalised coordinates x_i, \dot{x_i} (i=1,...,d) solves the Euler-Lagrange equations:

    \frac{d}{dt} (\frac{ \partial \mathcal{L}}{ \partial \dot{x_i}}) -\frac{\partial \mathcal{L}}{\partial x_i}=0

  2. By the chain rule \forall t \in [0,T] , we may develop this equation into:

    EL_i(\mathcal{L},t) := \sum\limits_{j=1}^{d} \frac{ \partial^2 \mathcal{L}}{ \partial x_j \partial x_i}\frac{d x_j}{dt} +\sum\limits_{j=1}^{d} \frac{ \partial^2 \mathcal{L}}{ \partial \dot{x_j} \partial \dot{x_i}}\frac{d \dot{x_j}}{dt} \frac{\partial \mathcal{L}}{\partial x_i}=0

  3. From this, Hillar and Sommer derive the following plausible fitness function:

    LFit(f) := -\frac{1}{N} \sum\limits_{k=1}^{N} log(1+\sum\limits_{j=1}^{d}|EL_i(\mathcal{L},t)|)

Meanwhile, we still have some questions to answer:

  1. What’s a reasonable form for candidate Lagrangians?
  2. Does our fitness function place sufficient constraints on the Lagrangian?

I have no doubt that Hillar & Sommer probably had reasonable answers to these questions but it’s D. Hillis et al. who provide a thorough response in their easy-to-read paper[1]. To the first question, Taylor’s theorem provides a reasonable answer:

  1. There exists an approximate polynomial representation for any Lagrangian, and we may assume that the Lagrangian may be represented by a polynomial in the coordinates and velocities.
  2. In order to place bounds on memory and computation, Hillis et al. use a reduced polynomial representation. i.e. there is a maximum power(m) of any coordinate for any coordinate/velocity and there’s a maximum degree(p) for any combination of coordinates and velocities.
  3. As an example, for a single variable \theta with m=2,p=3 we have:

    \mathcal{L}(\theta,\dot{\theta}) = c_1 \dot{\theta}^2+c_2 \theta+c_3 \theta^2+c_4 \theta \dot{\theta}^2 

Using Nelder-Mead to minimise a carefully chosen fitness function in terms of the coefficients c_i , Hillis et al. manage to find suitable Lagrangians for a variety of simple physical systems including the Duffing oscillator, Penning trap and Coupled harmonic oscillator. In fact, they go beyond using polynomial representations and in their Generalisation section explain how they use genetic programming to evolve complex functions based on simple functions(sine,cosine…etc.) and use an appropriate search algorithm that optimises the fitness score and also minimises the size of the ‘phylogenetic trees’.

From the results provided by M. Schmidt & H. Lipson, it appears that they used a very similar approach. However, unlike Hillis, their paper is very badly written and they don’t share the code they used. At best it might be considered a piece of marketing which would later be used to boost M. Schmidt’s data mining company, Nutonian.

Deception aside, I believe that it’s possible to produce innovative research in this area if researchers continue in the tradition of Hillis and Hillar. In particular, I think it might be possible to develop algorithms that find variational PDEs that describe a system’s behaviour and then use these variational PDEs to automatically classify physical systems. This, I believe, would be an important contribution to the physics community.

Note 1: It’s quite easy to search for Lagrangians with a restricted polynomial representation using Python with Scipy and autograd.

Note 2: Hillis et al. made their code available on Github. Although I don’t use Clojure, I learned a lot from going through the structure of their codebase.

References:

1.D. Hillis et al. “An algorithm for discovering Lagrangians automatically from data” PeerJ 20. October 2015
2.  C. Hillar et al. “Comment on the article “Distilling Free-Form Natural Laws from Experimental Data”” Arxiv. 26 October 2012
3. M. Schmidt and H. Lipson.”Distilling Free-Form Natural Laws from Experimental Data” Science. 3 April 2009

Learning integer sequences

Last Friday night I had the chance to watch the 2011 presentation of Demis Hassabis on ‘Systems neuroscience and AGI’ which I found fascinating. One of the intermediate challenges he brought up around the 33.40 minute mark was the problem of predicting the next term in an integer sequence. Given my math background, I thought I’d take a closer look at this problem.

Fanciful arguments:

On the surface this problem appears attractive for several reasons:

  1. It appears that no prior knowledge is required as all the data is contained in the n visible terms in the sequence.
  2. Small data vs Big data: little data is required in order to make reasonable prediction.
  3. It connects machine learning with data compression. In particular, it appears to emphasize the simplest algorithm(i.e. Ockham’s razor).

I shall demonstrate that all of the above preconceptions are mainly false.  Before presenting my arguments, I’d like to point out that there is a Kaggle competition on integer sequence prediction based on the OEIS dataset. However, I’d take the leaderboard with a grain of salt as it’s very easy to collect the sequences from the OEIS database and simply run a pattern-matching algorithm on the set of known sequences.

Examples of sequences:

Assuming that possible integer sequences are of the form (a_n)_{n=1}^\infty,a_n \in \{0,1,2,...,9\}  , here are some examples:

  1. Fibonacci numbers: 0,1,1,2,3,5…
  2. Collatz sequence: 0,1,7,2,5,8,16,3,19,6…number of steps for n to reach 1 in the Collatz process.
  3. Catalan numbers: 1,1,2,5,14…where a_n = \frac{1}{n+1}{2n \choose n}
  4. Khinchin’s constant: 2,6,8,5,4,5…base-10 representation of Khinchin’s constant

These four sequences come from the areas of basic arithmetic, number theory, combinatorics and real analysis but it’s possible to construct an integer sequence from a branch of mathematics that hasn’t been invented yet so the space of possible integer sequences is actually quite large. In fact, the space of computable integer sequences is countably infinite.

Bayesian inference:

I’d like to make the case that this is actually a problem in Bayesian inference which can potentially involve a lot of data. Let me explain.

When trying to predict the next term in an integer sequence,a_{n+1} ,  what we’re actually trying to discover is the underlying number generator and in order to do this we need to discover structure in the data. However, in order to discover any structure in the data we must make the strong assumption that the data wasn’t generated in a random manner. This hypothesis can’t be justified by any single sequence.

Moreover, assuming that there is structure, knowing the author of the sequence is actually very important. If the sequence was beamed to my mobile phone by friendly aliens from a distant galaxy, it would be difficult for me to guessa_{n+1}  with a better than chance probability of being right whereas if the sequence was provided by the local number theorist my odds would be slightly better. The reason is that my prior knowledge of the number theorists’ mathematical training is very useful information.

A reasonable approach would be to define a program that returns a probability distribution over at most ten sequence generators which generate distinct a_{n+1}  conditioned on prior knowledge. The hard part is that the body of the program would necessarily involve an algorithm capable of learning mathematical concepts.

Randomness and Data compression:

At this point someone might try to bring up a clever argument centred around Ockham’s razor. Now, I think that such an argument definitely has some merit. As a general rule it’s reasonable to assume that the sequence generator can be encoded in a language such that its description in that language is as short or shorter than the length of the sequence. Further, if we assume that the author is an organism they are probably highly organised in a spatial and temporal manner. In other words, it’s very unlikely that the sequence is random.

However, this doesn’t constrain the problem space. The necessary and sufficient general approach would be to use an AGI capable of abstract conceptual reasoning beyond that allowed by current machine learning algorithms. This is the only way it could possibly learn and use mathematical concepts. Good luck building that for a Kaggle competition.

Conclusion:

The main lesson drawn from this analysis is that in order to make measurable progress in the field of machine learning it’s very important to choose your problems wisely. In fact, I think it’s fair to say that choosing the right machine learning problems to work on is at least as important as the associated algorithmic challenges.

Note: It’s not surprising that nobody is prepared to offer money for such a competition. What I do find surprising is that some have tried to make ‘progress’ on this challenge.