Learning integer sequences

Last Friday night I had the chance to watch the 2011 presentation of Demis Hassabis on ‘Systems neuroscience and AGI’ which I found fascinating. One of the intermediate challenges he brought up around the 33.40 minute mark was the problem of predicting the next term in an integer sequence. Given my math background, I thought I’d take a closer look at this problem.

Fanciful arguments:

On the surface this problem appears attractive for several reasons:

  1. It appears that no prior knowledge is required as all the data is contained in the n visible terms in the sequence.
  2. Small data vs Big data: little data is required in order to make reasonable prediction.
  3. It connects machine learning with data compression. In particular, it appears to emphasize the simplest algorithm(i.e. Ockham’s razor).

I shall demonstrate that all of the above preconceptions are mainly false.  Before presenting my arguments, I’d like to point out that there is a Kaggle competition on integer sequence prediction based on the OEIS dataset. However, I’d take the leaderboard with a grain of salt as it’s very easy to collect the sequences from the OEIS database and simply run a pattern-matching algorithm on the set of known sequences.

Examples of sequences:

Assuming that possible integer sequences are of the form (a_n)_{n=1}^\infty,a_n \in \{0,1,2,...,9\}  , here are some examples:

  1. Fibonacci numbers: 0,1,1,2,3,5…
  2. Collatz sequence: 0,1,7,2,5,8,16,3,19,6…number of steps for n to reach 1 in the Collatz process.
  3. Catalan numbers: 1,1,2,5,14…where a_n = \frac{1}{n+1}{2n \choose n}
  4. Khinchin’s constant: 2,6,8,5,4,5…base-10 representation of Khinchin’s constant

These four sequences come from the areas of basic arithmetic, number theory, combinatorics and real analysis but it’s possible to construct an integer sequence from a branch of mathematics that hasn’t been invented yet so the space of possible integer sequences is actually quite large. In fact, the space of computable integer sequences is countably infinite.

Bayesian inference:

I’d like to make the case that this is actually a problem in Bayesian inference which can potentially involve a lot of data. Let me explain.

When trying to predict the next term in an integer sequence,a_{n+1} ,  what we’re actually trying to discover is the underlying number generator and in order to do this we need to discover structure in the data. However, in order to discover any structure in the data we must make the strong assumption that the data wasn’t generated in a random manner. This hypothesis can’t be justified by any single sequence.

Moreover, assuming that there is structure, knowing the author of the sequence is actually very important. If the sequence was beamed to my mobile phone by friendly aliens from a distant galaxy, it would be difficult for me to guessa_{n+1}  with a better than chance probability of being right whereas if the sequence was provided by the local number theorist my odds would be slightly better. The reason is that my prior knowledge of the number theorists’ mathematical training is very useful information.

A reasonable approach would be to define a program that returns a probability distribution over at most ten sequence generators which generate distinct a_{n+1}  conditioned on prior knowledge. The hard part is that the body of the program would necessarily involve an algorithm capable of learning mathematical concepts.

Randomness and Data compression:

At this point someone might try to bring up a clever argument centred around Ockham’s razor. Now, I think that such an argument definitely has some merit. As a general rule it’s reasonable to assume that the sequence generator can be encoded in a language such that its description in that language is as short or shorter than the length of the sequence. Further, if we assume that the author is an organism they are probably highly organised in a spatial and temporal manner. In other words, it’s very unlikely that the sequence is random.

However, this doesn’t constrain the problem space. The necessary and sufficient general approach would be to use an AGI capable of abstract conceptual reasoning beyond that allowed by current machine learning algorithms. This is the only way it could possibly learn and use mathematical concepts. Good luck building that for a Kaggle competition.


The main lesson drawn from this analysis is that in order to make measurable progress in the field of machine learning it’s very important to choose your problems wisely. In fact, I think it’s fair to say that choosing the right machine learning problems to work on is at least as important as the associated algorithmic challenges.

Note: It’s not surprising that nobody is prepared to offer money for such a competition. What I do find surprising is that some have tried to make ‘progress’ on this challenge.

perturbations that preserve constants of motion

simulation of three body problem with random initial conditions


Within the context of the gravitational n-body problem I’m interested in perturbations of the initial conditions  \{p_i(0),\dot{p}_i(0)\}_{i=1}^n    which leave all the constants of motion unchanged.

It’s clear to me that the linear momentum(P) is preserved under any transformation of the initial position vectors and energy(H) is preserved under any isometry applied to initial position vectors. However, when I add the constraint of conserving angular momentum(L), the general nature of these transformations which would leave (H,L, P) unchanged is not clear.

Granted, time translations would preserve the constants of motion but in a trivial manner. Now, I’m not sure whether the approach I’ve taken can be considerably improved but within the context of the three body problem in the plane, I transformed this problem into an optimisation problem as follows:

  1. We first calculate the constants of motion which are the linear momentum, energy and the angular momentum prior to perturbing the position of one of the n bodies. Let’s suppose that these constants are given by C_1, C_2, C_3 . 
  2. Assuming our bodies are restricted to move in the plane, there are a total of 4n scalars that determine the initial conditions for this n-body system. It follows that the perturbation of one body(q ) generally requires searching for a 4n-2  dimensional vector x  for positions and velocities. This can be done using Nelder-Mead, a derivative-free method, provided that we define a cost function. 
  3. The functions to be minimised are obtained directly from the linear momentum (P), angular momentum (L) and Hamiltonian (H) equations to obtain a single smooth cost function:\begin{aligned} C(x,q) =(P(x,q)-C_1)^2+(H(x,q)-C_2)^2+(L(x,q)-C_3)^2 \end{aligned}
  4. In order to increase the probability of convergence, the Nelder-Mead algorithm was randomly seeded for each perturbed position q .

I think this method can be easily extended to n-bodies.

point masses and Gauss’ Law of Gravity

My final year project involves searching for stable orbits for the gravitational n-body problem where we use Newton’s point mass approximation to describe the force field around masses assuming that these are approximately spherical. Now given that this model uses point masses we can’t model rotations. In fact, it’s generally assumed that they are negligible. 

However, I wondered whether this assumption can break down at some point. Consider the following:

  1. On the scale of millenia, the surface of our planet actually behaves like a fluid.
  2. If the distance between two planetary bodies is small and we assume zero rotation, this would imply a growing deviation from sphericity over time. It follows that the point-mass approximation would no longer make sense.
  3. In fact, our bodies would look something like blobs of gooey paint slowly dripping towards each other. 

While it’s true that the probability of zero rotation is close to zero the challenge of dealing with non-spherical masses is something that Newton wouldn’t have handled very well with his point-mass approximation. However, this is not quite the end of Newton’s model. 

Given that the gravitational potential in his model holds for point masses a natural solution would be to integrate over the volume enclosed by a compact body. This is exactly what Gauss did when he discovered the divergence theorem from which the Gauss Law of Gravity emerges as an immediate corollary. As this handles the case where our masses aren’t spherical the rest of my blog post will focus on the statement and proof of this theorem. 

Gauss Law of Gravitation:

Given U  as the gravitational potential and \vec{g} = -\nabla U  as the gravitational field, Gauss’ Law of Gravitation states:

\begin{aligned} \int_{V} (\nabla \cdot \vec{g}) dV =\int_{\partial V} \vec{g} \cdot d \vec{A}  \end{aligned}

In order to arrive at this result, I’ll first need to introduce two useful definitions:

flux:The flux \phi of a vector field through a surface is the integral of the scalar product of the field in a given point times the infinitesimal oriented surface at that point:

\begin{aligned} \phi =  \int_{\partial V} \vec{g} \cdot d \vec{A} \end{aligned}

where \vec{g} denotes the gravitational field, \partial V the boundary of our mass and d \vec{A} is our infinitesimal oriented surface.

divergence:The divergence represents volume density of the outward flux of a vector field from an infinitesimal volume around a given point:

\begin{aligned} \nabla \cdot \vec{g} \end{aligned}

The flux and divergence of a field are actually related in a local manner. Consider a box with dimensions \delta x_1,\delta x_2,\delta x_3 . Using Cartesian coordinates, let’s suppose that (x_1,x_2,x_3) denotes a corner on this box. Then the flux of \vec{g} projected onto \hat{x_3} is given by:

\begin{aligned} \phi_3 =(g_3(x_1,x_2,x_3+\delta x_3)-g_3(x_1,x_2,x_3)) \delta x_1 \delta x_2 \end{aligned}

For very small \delta x_3 ,

\begin{aligned} g_3(x_1,x_2,x_3+\delta x_3)-g_3(x_1,x_2,x_3) \approx \frac{\partial g_3}{\partial x_3} \delta x_3 \end{aligned}

where we get equality in the limit as \delta x_3 tends to zero.

As a result we have \forall i \in \{1,2,3 \}  :

\begin{aligned} \phi_i \approx \frac{\partial g_i}{\partial x_i} \delta x_1 \delta x_2 \delta x_3 =\frac{\partial g_i}{\partial x_i} \delta V  \end{aligned}

\begin{aligned} \phi =\sum_{i=1}^{3} \phi_i \end{aligned}

Now, we may deduce that the flux and divergence are related as follows:

\begin{aligned} \frac{\phi}{\delta V}\approx  \sum_{i=1}^{3} \frac{\partial g_i}{\partial x_i} \end{aligned}

\begin{aligned}\lim_{\delta V \to 0} \frac{\phi}{\delta V} = \nabla \cdot \vec{g} =  \sum_{i=1}^{3} \frac{\partial g_i}{\partial x_i} \end{aligned}

If we take an arbitrary solid and slice it into compact volumes V_i with associated surfaces \partial V_i we have:

\begin{aligned} \phi =\sum_{i=1}^{N} \phi_i =\sum_{i=1}^{N} \int_{\partial V_i} \vec{g_i} \cdot d \vec{A} =\int_{\partial V} \vec{g} \cdot d \vec{A} \end{aligned}

In addition, using the relationship between flux and divergence we have:

\begin{aligned} \phi =\lim_{N \to \infty} \lim_{V_i \to 0} \sum_{i=1}^{N} \nabla \cdot \vec{g_i} V_i = \int_{V} \nabla \cdot \vec{g} dV \end{aligned}

By equating the last two expressions we conclude the proof of this marvellous result by Gauss which is succinctly summarised on Wolfram as follows:

in the absence of the creation or destruction of matter, the density within a region of space can change only by having it flow into or away from the region through its boundary

At this point it’s natural to ask how can we describe the gravitational field which is defined in terms of the gravitational potential. In general, for non-spherical masses we can no longer use the usual inverse radius gravitational potential due to Newton.

In some sense we’ll need an infinite sum of perturbations in the gravitational potential which can be done using spherical harmonics, another wonderful mathematical invention that has many applications. This shall be the subject of my next blog post. 



The Ten Cupcake Problem

During my Christmas holidays I went through the regular ritual of eating cake with family relations. Occasionally I would observe crumbs fall off the sides of the cake and then think of the relative share each person got. These rituals continued until for the first time, I found an interesting math problem based on the challenge of dividing cake into equal volumes without leaving any crumbs. In fact it led me to an interesting conjecture at the intersection of geometry, topology and measure theory. 


A normal cake can be considered a compact euclidean manifold M \subset \mathbb{R}^3 homeomorphic to the sphere. If this manifold is also convex:

For any point on the boundary of our manifold, there exists a hyperplane H through this point such that we may decompose our manifold into three path-connected components that satisfy: 

\begin{aligned} M=M_1 \cup \{M \cap H\} \cup M_2 \end{aligned} 

\begin{aligned} Vol(M_1)=Vol(M_2)=Vol(M)/2 \end{aligned} 

The challenge is to show that the above volume-splitting properties hold for any compact manifold that is also convex but doesn’t hold otherwise. By this I mean that if our manifold happens to be non-convex then the above properties don’t hold for every point on its boundary. In other words we would observe crumbs.

This might seem like a relatively simple problem but it’s actually a more complex variant of the Ham Sandwich theorem which I learned from other users of the math stackexchange when I first shared the problem.

I leave it as an exercise to the reader to show that the volume-splitting property holds for any compact and convex manifold. The hard part is showing that for any compact and non-convex manifold the volume-splitting property doesn’t hold for all points on the boundary.

Right now it’s not clear to me whether this problem might exist under another name. If it happens to be unsolved I promise ten cupcakes to the first person that manages to solve it before me. 

Note: I have written a more structured article based on this blog post here. 

Differentiating under the integral sign

Back when I was in high-school I really enjoyed reading about Feynman’s entertaining numerical exploits. In particular, I remember integration challenges where he would use the Leibniz method of differentiating under the integral sign. I wasn’t taught this method at university but this detail resurfaced in my mind recently when I tackled a problem in hamiltonian dynamics where I had to differentiate under the integral sign. After using this method I decided to take a closer look at its mathematical justification.

Leibniz method:
For an integral of the form:

\begin{aligned} u(x) = \int_{y_0}^{y_1} f(x,y) dy \end{aligned} 


For all x in some open interval, the derivative of this integral is expressible as

\begin{aligned} u'(x) = \int_{y_0}^{y_1} f_x(x,y) dy  \end{aligned}

provided that f and f_x are both continuous over a region [x_0,x_1] \text{x}[y_0,y_1]



\begin{aligned} u(x) = \int_{y_0}^{y_1} f(x,y) dy \end{aligned}

\begin{aligned} u'(x) = \lim_{h\to 0} \frac{u(x+h)-u(x)}{h} \end{aligned}  

we may deduce:

\begin{aligned} u'(x) = \lim_{h\to 0} \frac{\int_{y_0}^{y_1} f(x+h,y) dy-\int_{y_0}^{y_1} f(x,y) dy}{h} = \lim_{n\to \infty} \int_{y_0}^{y_1} \frac{f(x+\frac{1}{n},y)-f(x,y)}{\frac{1}{n}} dy \end{aligned}

Now, if we define:

\begin{aligned} f_n (y) = \frac{f(x+\frac{1}{n},y)-f(x,y)}{\frac{1}{n}} \end{aligned}

it’s clear that we may apply the Bounded convergence theorem where:

\begin{aligned} \int_{y_0}^{y_1} \lim_{n\to \infty} f_n(y) dy = \lim_{n\to \infty} \int_{y_0}^{y_1} f_n(y) dy \end{aligned}

This is justified as the existence and continuity of f_x(x,y)  combined with the compactness of closed intervals implies that f_x(x,y) is uniformly bounded.

Note 1: I plan to share the generalization of this result to higher dimensions in the near future.

Note 2: This blog post is almost identical to the Wikipedia entry with some modifications and clarifications which I find useful.

Bounded convergence theorem

Today as I was trying to learn about the Leibniz method of differentiating under the integral sign, I came across the Bounded convergence theorem which happens to be very useful towards proving the one-dimensional case of the Leibniz method. It happens to be simpler to prove than the Lebesgue dominated convergence theorem but its still a very powerful theorem.

First, let’s introduce the notion of convergence in measure which is a useful generalization of the notion of convergence in probability.

Convergence in measure:

f_n \rightarrow f  in measure if \forall \epsilon > 0 ,

\begin{aligned} \mu(\{x \in X: |f_n(x)-f(x)| \geq \epsilon \}) \rightarrow 0 \text{ as } n \rightarrow \infty \end{aligned} 

Bounded convergence theorem:

If \{f_n \} are measurable functions that are uniformly bounded and f_n \rightarrow f in measure as n \rightarrow \infty , then

\begin{aligned} \lim_{n\to\infty} \int f_n dP = \int f dP \end{aligned}



\begin{aligned} | \int f_n dP -\int f dP| = | \int (f_n-f) dP| \leq  \int |f_n-f| dP \end{aligned}

we only need to prove that if f_n \rightarrow f in measure and |f_n| \leq M , then \int |f_n-f| dP \rightarrow 0 \text{ as } n \rightarrow \infty . So we have:

\begin{aligned} \int |f_n-f| dP = \int_{|f_n-f|\leq \epsilon} |f_n-f| dP + \int_{|f_n-f| > \epsilon} |f_n-f| dP \end{aligned}

From which we deduce:

\begin{aligned} \int |f_n-f| dP \leq \epsilon + MP \{x: |f_n(x)-f(x)|> \epsilon \} \end{aligned}

However, f_n \rightarrow f in measure so the RHS of the above inequality tends to zero as n \rightarrow \infty . From this the theorem follows.

Constructive proof of Kronecker’s density theorem

A couple months ago I shared a blog post on Dirichlet’s approximation theorem which we may use to prove that integer angles are dense in the unit circle. Now, I tried to go further and show that \forall k \in \mathbb{N}, e^{i k\mathbb{N}} is dense in the unit circle. This particular result is interesting because it leads one to deduce that \forall N \in \mathbb{N}, {\bigcap}_{n=0}^N sin( p_n\mathbb{N}) is dense in [-1,1]  where p_n is the nth prime.

This week I learned about the Weyl equidistribution theorem from which this particular result follows as an immediate corollary. However, I wondered whether there might be a simpler proof and after searching I came across a short paper on Kronecker’s density theorem which had exactly the result I wanted and a bit more. The purpose of this blog post is to present the proof contained in this paper.

Kronecker’s density theorem:
If \theta \in \mathbb{R} \setminus \pi \mathbb{Q} , then \{ e^{i \theta n} | n \in \mathbb{Z} \} is dense in the unit circle.


1. \mathbb{Q} \pi is dense in \mathbb{R} so it’s enough to prove that:

\begin{aligned} \forall t \in \mathbb{R} \forall \epsilon \in \pi \mathbb{Q} \exists n \in \mathbb{Z}, |e^{it}-e^{in\theta}|<\epsilon \end{aligned} 

Equivalently we have:

\begin{aligned} \forall t \in \mathbb{R} \forall \epsilon \in \mathbb{Q} \pi \exists p,q  \in \mathbb{Z}, |p\theta -t+2q\pi|<\epsilon \end{aligned}

2. Without loss of generality, we may assume that 0<\theta< 2 \pi since given that \theta \notin \mathbb{Q} \pi \exists k \in \mathbb{Z} ,

\begin{aligned} 0< \theta - 2k \pi < 2 \pi \end{aligned} 

It’s sufficient to prove the case t=0 since if we have found p,q \in \mathbb{Z} such that

\begin{aligned} |p\theta +2q\pi|<\epsilon \end{aligned}

for any t \in \mathbb{R} we can find k \in \mathbb{Z} such that:

\begin{aligned} |k - \frac{t}{p\theta+2q \pi}| <1 \end{aligned}

To obtain k it’s sufficient to take the integer part of the fraction in the inequality.

Now, it follows that we have:

\begin{aligned} kp\theta -t+2kq\pi|= |p\theta+2q\pi| |k - \frac{t}{p\theta+2q \pi}|< |p\theta -t+2q\pi|<\epsilon \end{aligned}

3. For fixed \epsilon \in \mathbb{Q} \pi the proof strategy is as follows:

a) starting at e^{i\theta} we move anti-clockwise around the unit circle in steps of arc length \theta until we cross the positive x-axis.

b) \theta \notin \mathbb{Q} \pi so we reach e^{i\theta_1}  where 0< \theta_1<\theta and

\begin{aligned} \exists k \in \mathbb{N}, k\theta=2\pi+\theta_1 \end{aligned}

c) if \theta_1-\theta > \epsilon we’re done. Otherwise, we repeat the procedure with \theta := \theta_1 .

d) The remainder of this proof will show that we can find an upper-bound for the number of iterations of \theta_n required.

5. Before proceeding, it’s useful to note that there exists \{k_n \}_{n=1}^{\infty} \subset \mathbb{N} where at the first iteration we have:

\begin{aligned} k_1\theta = \theta_1 + 2\pi \implies \theta_1 = k_1 \theta mod 2\pi < \theta \end{aligned}

So at the nth iteration we have:

\begin{aligned} \theta_n = \theta \prod_{i=1}^{n} k_i mod 2 \pi \end{aligned}

6. If \theta_i - \theta_{i-1} > -\epsilon , then we’re done as we have:

\begin{aligned} 0< \theta_{i-1}-\theta_{i}= (k_i - 1) \theta \prod_{j=1}^{i-1} k_i mod 2 \pi < \epsilon \end{aligned}

Otherwise, let’s define the smallest M \in \mathbb{N} such that:

\begin{aligned} M \epsilon > \theta \end{aligned}

If the process continued after the Mth iteration we would have:

\begin{aligned} 0< \theta_M < \theta_{M-1} -\epsilon < \theta_{M-2}-\epsilon-\epsilon < ...< \theta-M\epsilon < 0 \end{aligned}

a contradiction.

Note 1: The arguments in this blog post are taken almost entirely from the original paper with some slight modifications.

Note 2: The fact that the proof is constructive is very nice as Kronecker was a very strong advocate of constructive proofs.