Learning Lagrangians from Data

A few years ago, I watched in awe as Michael Schmidt gave a presentation on building a robot scientist based on his much-cited paper, “Distilling Free-Form Natural Laws from Experimental Data”. However, what wasn’t clear to me was how his algorithm discovered Lagrangians “without prior knowledge of physics”.

Fastforward several years in my maths degree and several mathematical physics courses; I decided to read this paper a second time. Upon a second reading, it became clear to me that M. Schmidt and H. Lipson had to use a fitness function that directly incorporated knowledge of Lagrangians in order to find the variational ODEs of interest. In fact, in a carefully argued paper, researchers at MSRI provide a reasonable candidate fitness function for deducing Lagrangians from data as follows[2]:

  1. \mathcal{L} of a system with generalised coordinates x_i, \dot{x_i} (i=1,...,d) solves the Euler-Lagrange equations:

    \frac{d}{dt} (\frac{ \partial \mathcal{L}}{ \partial \dot{x_i}}) -\frac{\partial \mathcal{L}}{\partial x_i}=0

  2. By the chain rule \forall t \in [0,T] , we may develop this equation into:

    EL_i(\mathcal{L},t) := \sum\limits_{j=1}^{d} \frac{ \partial^2 \mathcal{L}}{ \partial x_j \partial x_i}\frac{d x_j}{dt} +\sum\limits_{j=1}^{d} \frac{ \partial^2 \mathcal{L}}{ \partial \dot{x_j} \partial \dot{x_i}}\frac{d \dot{x_j}}{dt} \frac{\partial \mathcal{L}}{\partial x_i}=0

  3. From this, Hillar and Sommer derive the following plausible fitness function:

    LFit(f) := -\frac{1}{N} \sum\limits_{k=1}^{N} log(1+\sum\limits_{j=1}^{d}|EL_i(\mathcal{L},t)|)

Meanwhile, we still have some questions to answer:

  1. What’s a reasonable form for candidate Lagrangians?
  2. Does our fitness function place sufficient constraints on the Lagrangian?

I have no doubt that Hillar & Sommer probably had reasonable answers to these questions but it’s D. Hillis et al. who provide a thorough response in their easy-to-read paper[1]. To the first question, Taylor’s theorem provides a reasonable answer:

  1. There exists an approximate polynomial representation for any Lagrangian, and we may assume that the Lagrangian may be represented by a polynomial in the coordinates and velocities.
  2. In order to place bounds on memory and computation, Hillis et al. use a reduced polynomial representation. i.e. there is a maximum power(m) of any coordinate for any coordinate/velocity and there’s a maximum degree(p) for any combination of coordinates and velocities.
  3. As an example, for a single variable \theta with m=2,p=3 we have:

    \mathcal{L}(\theta,\dot{\theta}) = c_1 \dot{\theta}^2+c_2 \theta+c_3 \theta^2+c_4 \theta \dot{\theta}^2 

Using Nelder-Mead to minimise a carefully chosen fitness function in terms of the coefficients c_i , Hillis et al. manage to find suitable Lagrangians for a variety of simple physical systems including the Duffing oscillator, Penning trap and Coupled harmonic oscillator. In fact, they go beyond using polynomial representations and in their Generalisation section explain how they use genetic programming to evolve complex functions based on simple functions(sine,cosine…etc.) and use an appropriate search algorithm that optimises the fitness score and also minimises the size of the ‘phylogenetic trees’.

From the results provided by M. Schmidt & H. Lipson, it appears that they used a very similar approach. However, unlike Hillis, their paper is very badly written and they don’t share the code they used. At best it might be considered a piece of marketing which would later be used to boost M. Schmidt’s data mining company, Nutonian.

Deception aside, I believe that it’s possible to produce innovative research in this area if researchers continue in the tradition of Hillis and Hillar. In particular, I think it might be possible to develop algorithms that find variational PDEs that describe a system’s behaviour and then use these variational PDEs to automatically classify physical systems. This, I believe, would be an important contribution to the physics community.

Note 1: It’s quite easy to search for Lagrangians with a restricted polynomial representation using Python with Scipy and autograd.

Note 2: Hillis et al. made their code available on Github. Although I don’t use Clojure, I learned a lot from going through the structure of their codebase.

References:

1.D. Hillis et al. “An algorithm for discovering Lagrangians automatically from data” PeerJ 20. October 2015
2.  C. Hillar et al. “Comment on the article “Distilling Free-Form Natural Laws from Experimental Data”” Arxiv. 26 October 2012
3. M. Schmidt and H. Lipson.”Distilling Free-Form Natural Laws from Experimental Data” Science. 3 April 2009

The coordination problem

During the summer of 2016 I had the opportunity to intern at Hanson Robotics in Hong Kong. It was a stimulating experience which made me think carefully about the challenges posed by human-level AI. In particular, it led me to draw interesting parallels with the present day challenge of coordinating the actions of 7 billion human-level AIs. No pun intended. 

At Hanson Robotics the core belief is that by giving robots a human form they would experience the human condition like we do and as a result our relationship with robots will be one of empathy and understanding. The androids would have similar goals to humans and this would reduce the likelihood of conflict. In some sense the future wouldn’t look too different from the present and so to understand this version of the future, I had to reflect upon modern day reality.

What does goal alignment look like in modern human societies? Have we solved the problem of coordinating the actions of 7 billion human-level AIs? These questions are especially pertinent at a time when DIY weapons of mass destruction are on the horizon. By this I mean bioweapons that can potentially allow a few people to eliminate an entire ethnic group, effectively committing genocide.

It’s interesting to think about our attempted solutions to these problems in the context of the human-level AIs we’re familiar with given that the task of coordinating the actions of beyond-human AIs will probably be much harder. In general if your solution can’t solve a simple variant, it’s unlikely to solve the more general problem.

Here are two attempts to answer the questions I raised:

  1. The Point:

    This is a platform launched by Andrew Mason in 2006 with the help of Eric Lefkosky in order to ‘kickstart’ social good. It experienced modest success before morphing into Groupon and then inspiring the founders of Kickstarter.I think this is an important partial solution to the coordination problem as it allows people to update their beliefs in real-time about the beliefs of other people on an issue they consider important. In some sense, it allows democratic institutions to be setup on the fly within small communities and I believe this deals with the ‘tyranny of the majority’ problem that often arises in a democracy.

  2. Democracy Earth:

    The notion of a decentralised, peer-to-peer democracy sounds very attractive and I think it’s a solution that works with platforms like ‘The Point’ in the sense that it serves small communities very well. However, I don’t see how a peer-to-peer democracy would be able to address a national security risk, environmental disaster, or the issue of maintaining cultural institutions. I imagine that the founders of this group are reasonable and don’t expect full decentralisation.

Now, the issue I have with both ‘The Point’ and ‘Democracy Earth’ is that while they both attempt to tackle the issue of coordinating small communities, it’s not clear to me how they would scale to massive communities. In fact, I don’t believe they can and this point is very important as without large-scale(i.e. global) coordination conflict is inevitable.

Consider the European Union for example. Many brilliant people including Dr. Stiglitz and Dr. Varoufakis have laid out careful plans for solving the European coordination problem. However, there’s little political will among politicians to execute a trans-national strategy that will fundamentally change the way the EU functions. This might have devastating consequences but I’m not here to encourage pessimism. On the contrary, I believe that a large-scale technological solution to  goal-alignment(i.e. coordination) might be possible.

The exact form of this solution isn’t clear to me right now although I believe that higher social network connectivity might be part of the solution. For this reason, I’d like to invite my readers to share their own solutions to this problem as I believe this would make it more likely that we discover a practical solution.

Are wheels optimal for locomotion?

Given the large number of wheeled robots coming out of modern factories it’s natural to ask whether wheels are near-optimal for terrestrial locomotion. I believe this question may be partially addressed from a physicists’ perspective although the general problem of finding near-optimal robot models appears to be intractable.

Physics:

A reasonable scientific approach to this problem would involve a laboratory setting of some sort. The scientist may setup an inclined ramp and analyse the rolling motion of different compact shapes with the same mass and volume. In particular it’s natural to ask: Of all the possible shapes having the same material, mass and volume which shape reaches the bottom of the ramp first when released from the top of the ramp?

If we may neglect dissipative forces, we may determine the time for a disk to reach the bottom of the ramp as follows:

  1. By the parallel axis theorem, we have:

    \begin{aligned} \dot{w} = \frac{\tau}{I}=\frac{mgrsin\theta}{0.5 m r^2} \end{aligned}

    where \theta is the ramp’s angle of inclination, m is the mass of the disk and r is the radius of the disk.

  2. As a result, the linear acceleration of the disk is constant:

    \begin{aligned} \ddot{x}= 2gsin\theta \end{aligned}

  3. From this we deduce that given a ramp with length l , the total time to reach the bottom of the ramp is given by:

    \begin{aligned} T = 2\sqrt{\frac{l}{gsin\theta}} \end{aligned}

This analysis, although useful, is difficult to replicate for other shapes as the moment of inertia will actually depend on the axis chosen. Further, beyond the special case of certain symmetric bodies the total time taken to reach the bottom of the ramp can no longer be determined in an analytic manner. Barring a mathematical method I’m unaware of, we would need to simulate the motion of each body rolling/bouncing down the incline with high numerical precision and then perform an infinite number of comparisons.

Engineering:

At this point it may seem that our task is impossible but perhaps we can make progress by asking a different question…Of all the possible shapes that move down the incline, which have the most predictable and hence controllable motion? The answer to this is of course the spherical bodies(ex. the wheel) and it’s for this reason that engineers build cars with wheels to roll on roads. A different shape that reached the bottom of the incline sooner than the wheel would be useless for locomotion if we couldn’t determine where it would land. For these reasons, our mathematical conundrum may be safely ignored.

In contrast, as natural terrain tends to be highly irregular, animal limbs must be highly versatile. I have yet to come across wheels that simultaneously allow grasping, climbing, swimming and jumping. Furthermore, even if there were a large region of land that was suitable for wheeled locomotion, these animals would inevitably be confined to this region and this situation would be highly detrimental to their ability to explore new terrain.

Given these arguments, it’s clear that wheeled robots aren’t well suited for terrestrial locomotion. Having said this, can we build robots that are near-optimal for terrestrial locomotion? What would we mean by optimal?

Conclusion:

I have thought more broadly about the possibility of using computers to design near-optimal robots for terrestrial locomotion but the problem appears to be quite hard. Given that the cost function isn’t well-defined the search space will be very large. It would make sense to place a prior on our cost function and try to use a method similar to Bayesian Optimisation. However, the task entails a method much more sophisticated than the Bayesian Optimisation methods that are used in industry today.

I’ll probably revisit this challenge in the future but I think it would make sense to first solve an intermediate challenge in Bayesian Optimisation applied to robotics.

The case for compulsory public service

It occurred to me recently that as citizens we seldom discuss the requirements of a healthy social fabric which is essential for a democracy. We somehow expect miracles from our governments while the most we expect from ourselves as citizens is to vote every four years and pay our taxes.The working assumption is that improving a democracy is either impossible or the job of politicians. Everybody else can just chill back and watch Netflix videos. I’m not sure this can continue.

Perhaps it’s unfair to use the term ‘everybody’. After all a significant fraction of the US and UK population volunteer in some form at least once month. However, this fraction is still smaller than 30% in both the UK and USA. What if less than 30% of the population paid taxes?

At this point, you might wonder what difference can volunteering make? I’ll get to that point but first allow me to share an observation I’ve made while visiting many small towns around Britain.

In each town I would talk to people of all ages including elderly British people who would invariably complain about how the internet has changed their community. A particularly pernicious consequence of the internet culture is that it has allowed society to become transactional to the detriment of important social bonds. Where in the pre-internet era there might have been lively conversations between shopkeepers and customers, today people might get whatever they need on Amazon or have their supermarket goods home-delivered.

This results in social fragmentation which makes the notion of ‘shared values’ farcical. As a result, the populations of Europe and the USA are not only socially stratified in the present, but their future expectations of society are that socioeconomic divisions will deepen. Now, I believe that community service can offer an important partial solution to this problem. In particular, I believe that weekly community service should be a precondition for citizenship.

Personally, I’ve been helping kids learn programming for free at Prewired on a weekly basis but that’s just one way of building a sense of community that isn’t socially stratified in its outlook. I think a lot more needs to be done and the only way for a sense of community to be restored within cities is if every citizen acknowledged the importance of volunteering once a week within their community.

I understand that a lot of people will try to resist this idea. Many will complain that they are too busy. People will try to find all sorts of excuses and argue that there are bigger issues. It’s my fear that people will look for excuses until Western democracies become polarised beyond repair.

Note: I have a lot of respect for the DiEM25 movement led by Yanis Varoufakis and it’s my belief that community service is a necessary complement for this ambitious agenda.

The case against Universal Basic Income

Recently, Universal Basic Income has received a lot of publicity and the general belief is that it’s the solution to technological unemployment. While I agree that technological unemployment is an issue that must be dealt with I believe that a Negative Income Tax is a much more intelligent partial solution. I shall start by presenting the fundamental problems with Universal Basic Income before introducing the advantages of a Negative Income Tax.

Three fundamental problems with Universal Basic Income:

1. UBI pilot studies are fundamentally flawed:

  • Let’s define the variable Y, the actual change in behaviour under UBI
  • Pilot studies measure a proxy variable X which is measurable in a setting where all recipients know that these handouts will only last a fixed amount of time
  • The deviation between X and Y may be arbitrarily large as the handouts received during finite-time pilot studies are treated as capital investments whereas in the UBI scenario, these handouts are treated as a common good whose marginal utility diminishes with time.

For the above reasons, there’s no reason to believe that a person receiving regular cash benefits will behave in a particularly enterprising manner.

2. UBI is based on the assumption that a flat rate is a good idea. However, if we assume any fixed amount for UBI, the population total will be prohibitively expensive. Consider the UK population, and a basic monthly rate of 1000 pounds per month issued 10 months a year:

\begin{aligned} 6.10^7.10^3.10=6.10^{11} \end{aligned}

That’s .6 trillion pounds per annum, more than 15 times the UK military budget!

3. The impact of automation won’t be uniformly distributed. For this reason, the ‘universal’ assumption doesn’t make sense.

I’ve spent my last two summers in robotics labs in the UK and Hong Kong and from these experiences I’ve learned a few things. First, the task of defining a reward function for a robot on a continuous action space is highly nontrivial so we won’t have any fully autonomous robots anytime soon. Second, concept learning is equally hard so jobs that involve coming up with new concepts(ex. R&D, product development) won’t disappear anytime soon. However, anything that is procedural or mainly requires probabilistic inference might disappear in the next ten year. This means that work won’t disappear although certain jobs(ex. law, medicine,accounting) will become highly automated.

So that’s what must be said about UBI. It’s a terrible system that isn’t cost-effective and it disincentivizes work while making the false assumption that automation will have a uniform impact on the job market.

Negative Income Tax:

The basic idea is to use the mechanism by which we now collect tax revenues from people with incomes above some minimum level to provide financial assistance to people with incomes below that level.

-Milton Friedman

  1. The NIT is a reasonable alternative to UBI which doesn’t create disincentives to work as citizens aren’t guaranteed a basic salary.
  2. In its most basic implementation, the NIT is very simple:\begin{cases} m = \text{minimum salary} \\ S = \text{actual salary}\\ \alpha = \text{tax rate}\\ \Delta S = S-m \end{cases} If \Delta S \geq 0  , then the citizen incurs a normal tax and when \Delta S < 0   the citizen gets paid -\alpha \Delta S  .
  3. NIT is much less expensive to implement:Let’s suppose half the UK population is living below the baseline salary of 10000 pounds and \alpha = 0.5  . Then in the worst case scenario, the NIT program would cost:\begin{aligned} \frac{1}{4}.6.10^7.10^4=\frac{3}{2}.10^{11} \end{aligned} That’s 4 times less than the cost of implementing UBI! Moreover, I must add that this is a highly improbable scenario and we can also define much more sophisticated NIT systems with an adaptive tax rate.

Discussion:

From an economic perspective, the Negative Income Tax has the advantage that it’s economically feasible and it doesn’t create disincentives to work which is very important as a post-work future is at least several centuries away. However, an economic solution is only a partial solution to the problem of technological unemployment. There will also need to be robust programs for re-educating and retraining a significant fraction of the population. A concrete example of this is Xavier Niel’s free software engineering institution and South Africa’s free wethinkcode institution. This is something that governments should work on in partnership with companies that understand what the future of work will look like.

Conclusion:

In summary, we should take UBI pilot studies with a grain of salt, the implicit ‘universal’ assumption is dubious and any UBI program would be prohibitively expensive in practice. In comparison, the negative income tax which has none of these defects, is a much more reasonable economic solution although it shouldn’t be considered a complete answer to the problem of technological unemployment.

Democracy 2.0

September 11 was a global disaster by any measure. While it cost Al Qaeda half a million dollars to plan and execute this mission, the combined response of the US military and Homeland security exceeded 3 trillion dollars. That 6 million to 1 cost-benefit ratio captures the asymmetric nature of the Western ‘war on terror’ very well.

However, the profound psychological impact of terrorism is much more clearly reflected by the growing populist movements in Western Europe and the USA which have chosen muslim immigrants as the perfect scapegoat. As much as I would like to understand the real problems faced by those backing Le Pen and Trump it’s clear that the populist ‘solution’ will only increase the probability of homegrown terrorism. Hereon, my intention is to analyse these problems and present a solution.

The underlying problems:

First, I must say that war is probably the dumbest possible solution to a very complex problem. Second, the current efforts to monitor homegrown terrorism won’t solve the underlying problem which is that the social fabric in Western countries is broken both socially and economically. While a democracy needs a strong middle class and citizens that have shared values, in Western countries both of these pillars are in peril today.

A foolish resistance:

In the USA, the symptoms of a broken social fabric can’t be made clearer by a ‘Resist’ movement, led by well-off intellectuals and businessmen, which defines itself by vilifying Trump supporters. No doubt globalisation and technological unemployment are playing an important role in this socioeconomic divide and it’s only by addressing these underlying problems that we shall find a solution. Amplifying deep socioeconomic divisions via virulent rhetoric, as practised by the ‘Resistance’, will only lead to greater social unrest.

DIY weapons of mass destruction:

Meanwhile, the risks posed by important social unrest are growing as the development of DIY bioweapons becomes easier with each passing year. In 2016 alone, France had more than 10 terrorist attacks. Most of these involved knives but what if biological weapons were used instead? Furthermore, if we consider the number of mass shootings in the USA every year, it’s clear that the threat isn’t strictly Islamist in nature.

As technology becomes more powerful it also leaves a smaller margin for human error. While a knife might allow a person to harm a few others, a bioweapon can potentially allow a few people to eliminate an entire ethnic group, effectively committing genocide. Moreover, due to the dual-use nature of these technologies we can’t ban their development and for this reason we must work seriously towards greater social cohesion. We don’t have another option.

Democracy 2.0:

After thinking carefully about these problems, I propose the following:

  1. Negative income tax: this would address the problem of economic inequality posed by technological unemployment.
  2. Compulsory public service: this would address the problem of a decaying social fabric.

The first solution might seem self-explanatory but the second solution occurred to me empirically, while visiting small towns around Britain. In each town I would talk to shopkeepers who would invariably complain about how the internet had a negative effect on their business. However, a more pernicious consequence of the internet culture is that it has allowed society to become transactional to the detriment of important social bonds. Where in the pre-internet era there might have been lively conversations between shopkeepers and customers, today people might get whatever they need on Amazon or have their supermarket goods home-delivered.

While we expect miracles from our governments the most we expect from ourselves as citizens is to vote every four years and pay our taxes. Clearly, this can’t continue. If everybody volunteered an hour or two every week for community service, we could rebuild an important sense of community, and therein lies the solution. Personally, I’ve been helping kids learn programming at Prewired on a weekly basis but that’s just one way of building a sense of community that isn’t socially stratified in its outlook.

Conclusion:

Finally, I’d like to emphasise that it’s definitely within our ability as humans to solve these problems and I believe that these challenges will ultimately drive us to build stronger communities and become better people.

Note 1: If you consider my message important, please share it with your friends and colleagues, either by sharing my post or your version of it.

Note 2: I think that in order to address these issues we’ll have to work collectively, as humans, to address these considerable challenges in an intelligent manner.

Learning integer sequences

Last Friday night I had the chance to watch the 2011 presentation of Demis Hassabis on ‘Systems neuroscience and AGI’ which I found fascinating. One of the intermediate challenges he brought up around the 33.40 minute mark was the problem of predicting the next term in an integer sequence. Given my math background, I thought I’d take a closer look at this problem.

Fanciful arguments:

On the surface this problem appears attractive for several reasons:

  1. It appears that no prior knowledge is required as all the data is contained in the n visible terms in the sequence.
  2. Small data vs Big data: little data is required in order to make reasonable prediction.
  3. It connects machine learning with data compression. In particular, it appears to emphasize the simplest algorithm(i.e. Ockham’s razor).

I shall demonstrate that all of the above preconceptions are mainly false.  Before presenting my arguments, I’d like to point out that there is a Kaggle competition on integer sequence prediction based on the OEIS dataset. However, I’d take the leaderboard with a grain of salt as it’s very easy to collect the sequences from the OEIS database and simply run a pattern-matching algorithm on the set of known sequences.

Examples of sequences:

Assuming that possible integer sequences are of the form (a_n)_{n=1}^\infty,a_n \in \{0,1,2,...,9\}  , here are some examples:

  1. Fibonacci numbers: 0,1,1,2,3,5…
  2. Collatz sequence: 0,1,7,2,5,8,16,3,19,6…number of steps for n to reach 1 in the Collatz process.
  3. Catalan numbers: 1,1,2,5,14…where a_n = \frac{1}{n+1}{2n \choose n}
  4. Khinchin’s constant: 2,6,8,5,4,5…base-10 representation of Khinchin’s constant

These four sequences come from the areas of basic arithmetic, number theory, combinatorics and real analysis but it’s possible to construct an integer sequence from a branch of mathematics that hasn’t been invented yet so the space of possible integer sequences is actually quite large. In fact, the space of computable integer sequences is countably infinite.

Bayesian inference:

I’d like to make the case that this is actually a problem in Bayesian inference which can potentially involve a lot of data. Let me explain.

When trying to predict the next term in an integer sequence,a_{n+1} ,  what we’re actually trying to discover is the underlying number generator and in order to do this we need to discover structure in the data. However, in order to discover any structure in the data we must make the strong assumption that the data wasn’t generated in a random manner. This hypothesis can’t be justified by any single sequence.

Moreover, assuming that there is structure, knowing the author of the sequence is actually very important. If the sequence was beamed to my mobile phone by friendly aliens from a distant galaxy, it would be difficult for me to guessa_{n+1}  with a better than chance probability of being right whereas if the sequence was provided by the local number theorist my odds would be slightly better. The reason is that my prior knowledge of the number theorists’ mathematical training is very useful information.

A reasonable approach would be to define a program that returns a probability distribution over at most ten sequence generators which generate distinct a_{n+1}  conditioned on prior knowledge. The hard part is that the body of the program would necessarily involve an algorithm capable of learning mathematical concepts.

Randomness and Data compression:

At this point someone might try to bring up a clever argument centred around Ockham’s razor. Now, I think that such an argument definitely has some merit. As a general rule it’s reasonable to assume that the sequence generator can be encoded in a language such that its description in that language is as short or shorter than the length of the sequence. Further, if we assume that the author is an organism they are probably highly organised in a spatial and temporal manner. In other words, it’s very unlikely that the sequence is random.

However, this doesn’t constrain the problem space. The necessary and sufficient general approach would be to use an AGI capable of abstract conceptual reasoning beyond that allowed by current machine learning algorithms. This is the only way it could possibly learn and use mathematical concepts. Good luck building that for a Kaggle competition.

Conclusion:

The main lesson drawn from this analysis is that in order to make measurable progress in the field of machine learning it’s very important to choose your problems wisely. In fact, I think it’s fair to say that choosing the right machine learning problems to work on is at least as important as the associated algorithmic challenges.

Note: It’s not surprising that nobody is prepared to offer money for such a competition. What I do find surprising is that some have tried to make ‘progress’ on this challenge.