Learning integer sequences

Last Friday night I had the chance to watch the 2011 presentation of Demis Hassabis on ‘Systems neuroscience and AGI’ which I found fascinating. One of the intermediate challenges he brought up around the 33.40 minute mark was the problem of predicting the next term in an integer sequence. Given my math background, I thought I’d take a closer look at this problem.

Fanciful arguments:

On the surface this problem appears attractive for several reasons:

  1. It appears that no prior knowledge is required as all the data is contained in the n visible terms in the sequence.
  2. Small data vs Big data: little data is required in order to make reasonable prediction.
  3. It connects machine learning with data compression. In particular, it appears to emphasize the simplest algorithm(i.e. Ockham’s razor).

I shall demonstrate that all of the above preconceptions are mainly false.  Before presenting my arguments, I’d like to point out that there is a Kaggle competition on integer sequence prediction based on the OEIS dataset. However, I’d take the leaderboard with a grain of salt as it’s very easy to collect the sequences from the OEIS database and simply run a pattern-matching algorithm on the set of known sequences.

Examples of sequences:

Assuming that possible integer sequences are of the form (a_n)_{n=1}^\infty,a_n \in \{0,1,2,...,9\}  , here are some examples:

  1. Fibonacci numbers: 0,1,1,2,3,5…
  2. Collatz sequence: 0,1,7,2,5,8,16,3,19,6…number of steps for n to reach 1 in the Collatz process.
  3. Catalan numbers: 1,1,2,5,14…where a_n = \frac{1}{n+1}{2n \choose n}
  4. Khinchin’s constant: 2,6,8,5,4,5…base-10 representation of Khinchin’s constant

These four sequences come from the areas of basic arithmetic, number theory, combinatorics and real analysis but it’s possible to construct an integer sequence from a branch of mathematics that hasn’t been invented yet so the space of possible integer sequences is actually quite large. In fact, the space of computable integer sequences is countably infinite.

Bayesian inference:

I’d like to make the case that this is actually a problem in Bayesian inference which can potentially involve a lot of data. Let me explain.

When trying to predict the next term in an integer sequence,a_{n+1} ,  what we’re actually trying to discover is the underlying number generator and in order to do this we need to discover structure in the data. However, in order to discover any structure in the data we must make the strong assumption that the data wasn’t generated in a random manner. This hypothesis can’t be justified by any single sequence.

Moreover, assuming that there is structure, knowing the author of the sequence is actually very important. If the sequence was beamed to my mobile phone by friendly aliens from a distant galaxy, it would be difficult for me to guessa_{n+1}  with a better than chance probability of being right whereas if the sequence was provided by the local number theorist my odds would be slightly better. The reason is that my prior knowledge of the number theorists’ mathematical training is very useful information.

A reasonable approach would be to define a program that returns a probability distribution over at most ten sequence generators which generate distinct a_{n+1}  conditioned on prior knowledge. The hard part is that the body of the program would necessarily involve an algorithm capable of learning mathematical concepts.

Randomness and Data compression:

At this point someone might try to bring up a clever argument centred around Ockham’s razor. Now, I think that such an argument definitely has some merit. As a general rule it’s reasonable to assume that the sequence generator can be encoded in a language such that its description in that language is as short or shorter than the length of the sequence. Further, if we assume that the author is an organism they are probably highly organised in a spatial and temporal manner. In other words, it’s very unlikely that the sequence is random.

However, this doesn’t constrain the problem space. The necessary and sufficient general approach would be to use an AGI capable of abstract conceptual reasoning beyond that allowed by current machine learning algorithms. This is the only way it could possibly learn and use mathematical concepts. Good luck building that for a Kaggle competition.


The main lesson drawn from this analysis is that in order to make measurable progress in the field of machine learning it’s very important to choose your problems wisely. In fact, I think it’s fair to say that choosing the right machine learning problems to work on is at least as important as the associated algorithmic challenges.

Note: It’s not surprising that nobody is prepared to offer money for such a competition. What I do find surprising is that some have tried to make ‘progress’ on this challenge.


Behavioral Syntax v 1.1

I’ve been contributing to a Python-based behavioral Turing Test that’s part of OpenWorm for a while now and Behavioral Syntax v 1.1 is almost ready so it’s time I announced its existence. The methods are based on the ‘Behavioral Syntax’ paper of Andre Brown et al. which attempts to describe the locomotion of C. Elegans in terms of a vocabulary of ‘worm shapes’ that are determined using K-means clustering.

the green segment designates the head

90 distinct postures are sufficient to account for 82% of postural variance so we
can map worm shapes from videos to these 90 postures with minimal loss of
information. And lo and behold! We now have a problem that amounts to comparing sequences and suddenly we can use methods from natural language processing and bioinformatics.

Now, I must add that this project was never intended to merely be a Python
version of Dr Brown’s Matlab code. From the beginning I planned to design
a fast behavioral Turing Test that would serve as a first pass filter. It would
automatically generate lab reports that would make it simple to assess whether
or not a simulated worm resembled a normal C. Elegans in its behavior.

The primary method I plan to use is bayesian classification using sub-models
such as minimal description length, grammatical inference, hierarchical
Markov models, and simpler things such as computing posture heat-map
similarity. A posture heat-map might be useful in a uniform setting such as
agar petri dishes off-food. Assuming that a standard search algorithm such
as Lévy flight search is in use, I would expect a certain pattern in the
occurrence of postures. You might think that Lévy flights would be off-limits
to my project since they represent continuous-time behavior but my plan is
actually to compare ‘continuous’ and ‘discrete’ models. It would be interesting
to see what kind of ‘false’ worms can actually fool a discrete Turing test.

Speaking of which, I’m actually planning to begin work on a ‘false worm’
repository. Initially these would be simple test cases to see whether the
Behavioral Turing test can actually be fooled by a random
piece-wise harmonic path on a flat surface. I can’t think of a better way to
see whether these tests actually have any value.

Stay tuned. In less than a week it should be possible for anybody to install the repository using pip.