Introduction:

Within the realm of computer science, we are generally interested in the properties of reusable programs(i.e. programs with loops). Moreover, when modelling rare events we are implicitly concerned with complex systems that exhibit recurrent behaviour i.e. rare events that occur with some regularity(ex. earthquakes, car crashes at a traffic intersection). In both cases, scaling laws for the asymptotic complexity of the behaviour of the requisite software systems are useful as proxy measures of the complexity of the problem being addressed.

In the special case of rare event modelling, as the initial conditions of complex systems are generally uncertain and partially observable, an asymptotic analysis of the expected memory requirements of the optimal machine learning model corresponds to scaling laws for the expected information gained from correctly predicting the occurrence of rare events.

In this article, I will show that this generally corresponds to the Expected Kolmogorov Complexity \(\mathbb{E}[K(\cdot)]\) of a dataset \(X\) relative to a machine learning model \(\Omega\), so \(\mathbb{E}[K(\cdot)]\) serves the role of an algorithmic Occam’s razor.

From Bayes’ theorem to Occam’s razor:

When modelling rare events, the datasets are typically small so we can’t use popular methods for large-scale machine learning and we must be especially rigorous in the modelling process. Moreover, given that we generally want a causal model which we may use to design policies in order to avoid undesirable outcomes such as forest fires or land slides, we would ideally use a method which would allow us to approximate the data-generating process.

In this setting, Occam’s razor is very useful for model selection as it is biased towards accurate models that are as simple as possible. This allows us to favour accurate models that don’t contain false/irrelevant axioms and avoid complex models that may be used to explain anything and everything. For this reason, Occam’s razor is also known as the Minimum Description Length principle; it may also be derived directly from Bayes’ theorem.

Let’s suppose we have a data stream \(X=\{x_i\}_{i=1}^n\) and we want to solve the system identification problem:

\begin{equation} x_{n+1} = f \circ x_n \end{equation}

where we don’t know \(f\) so we would like to approximate it using a machine learning model \(f_{\theta}\). Given that \(f_{\theta}\) is our hypothesis, using Bayes’ theorem we have:

\begin{equation} P(X|f_{\theta}) = \frac{P(f_{\theta} |X) \cdot P(X)}{P(f_{\theta})} \end{equation}

Now, the Minimum Description Length of the dataset \(X\) in terms of the best model \(\Omega\) may be expressed in terms of orthogonal Shannon entropies:

\begin{equation} \text{MDL}(X) = \mathbb{E}[-\log P(X|\Omega)] + \mathbb{E}[-\log P(\Omega)] = H(X|\Omega) + H(\Omega) \end{equation}

where \(H(X|\Omega)\) is a measure of model accuracy and \(H(\Omega)\) is a measure of model complexity. Now, given that almost all physical systems satisfy the principle of locality we should generally expect that \(H(\Omega)\) should be small.

Curiously, an insightful derivation of Occam’s razor is also possible via Algorithmic Information Theory.

A dynamical systems approach to Expected Kolmogorov Complexity:

We may define the Kolmogorov Complexity of a string \(x\) as the size of the smallest program \(p\)(measured in bits) which may be simulated on a Universal Turing Machine, \(U\) such that \(U(p)=x\):

\begin{equation} K_U(x) := \min_{p} \{l(p): U(p) = x\} \end{equation}

where \(K_U(x)\) is defined up to an additive constant which doesn’t depend on \(x\), by the invariance theorem [4]. Within the context of the Planck-Church-Turing thesis, this constant is exactly the size of the computing device that is simulating the Observable Universe itself [5].

In this context, it may be shown that an alternate derivation of Occam’s razor is possible [6]:

\begin{equation} \mathbb{E}[K(X)] = H(X|\Omega) + H(\Omega) \end{equation}

where \(\Omega\) is the machine learning model that minimises the description length of the data-generating process from which \(X\) was sampled.

The relation between Bayesianism and Algorithmic Information Theory:

Given a dataset of observations \(X = \{x_i\}_{i=1}^n\) if we want to solve the system identification problem of identifying \(f\) where:

\begin{equation} x_{n+1} = f \circ x_n \end{equation}

we may define the Bayesian model \(f_{\theta}\) where \(P(x_{n+1} \lvert f_{\theta}\circ x_n)\) quantifies the degree of belief of \(f_{\theta}\) that \(x_{n+1} = f \circ x_n\). So we may define:

\begin{equation} H(X|f_{\theta}) = - \sum_{k=1}^{n-1} P(x_{k+1}| f_{\theta} \circ x_k) \ln P(x_{k+1}| f_{\theta} \circ x_k) \end{equation}

and if \(\text{Dom}(f_{\theta}) = X_{\theta}\) we have:

\begin{equation} H(f_{\theta}) = - \sum_{x_n, x_{n+1} \in X_{\theta}} P(x_{n+1}| f_{\theta} \circ x_n) \ln P(x_{n+1}| f_{\theta} \circ x_n) \end{equation}

which is a measure of the confidence of \(f_{\theta}\) in its predictions as well as a measure of the complexity of \(f_{\theta}\).

In summary, for any machine learning model \(f_{\theta}\):

\begin{equation} \mathbb{E}[K(X)] \leq H(X|f_{\theta}) + H(f_{\theta}) \end{equation}

It is not a coincidence that Bayesianism and Algorithmic Information Theory offer complementary perspectives on the Minimum Description Length principle as the motivations of Bayes and Kolmogorov were similar; they both wanted to characterise unique phenomena. Kolmogorov sought a measure of the complexity of particular programs. Bayes, on the other hand, wanted a method for quantifying the probability of events which might occur only once. This led him to the notion of probability as a measure of a scientist’s degree of belief.

Given that computations are observer-dependent and Bayesianism requires that we quantify degrees of belief, it is fair to say that the scientific method presupposes the existence of conscious observers. This is an idea which shall be developed in more detail in a future article.

Asymptotic analysis of the Expected Kolmogorov Complexity:

The Shannon entropy \(H(\cdot)\) is additive for a sequence of uncorrelated random variables \(x_i\) so we have:

\begin{equation} H(\Omega) \leq \mathbb{E}[K(\{x_i\}_{i=1}^n)] \leq n \cdot H(X) \end{equation}

Now, given that computations are observer-dependent and that Turing’s fixed point theorem places strong epistemological constraints on what can be a priori known about the attractors of a dynamical system, for chaotic systems with discrete state-spaces \(X\) the situation is quite dramatic. Even if we know the transition function \(f\):

\begin{equation} x_{n+1} = f \circ x_n \end{equation}

if we had the smallest doubt about the initial conditions \(x_0\), for large \(n\):

\begin{equation} \mathbb{E}[K(\{x_i\}_{i=1}^n | f,x_0)] \sim n \cdot \ln |X| \end{equation}

This scaling relationship is crucial as it represents the expected information gained from correctly predicting each state in the sequence \(\{x_i\}_{i=1}^n\), and most rare-event processes are hard to model precisely because they are chaotic. There are several reasons for this general result. Chaotic systems are highly sensitive to the initial conditions, they exhibit strong mixing properties and randomness is nothing more than a placeholder for unpredictable phenomena.

To say that long-term prediction of chaotic dynamical systems is hard is an understatement. It would be more precise to say that the best machine learning model for predicting the long-term behaviour of chaotic systems is a maximum entropy distribution over all possible trajectories.

In the long run, the best model is no model so we have to accept that for chaotic systems we can rarely look more than a few steps ahead.

What about the recurrence time of deterministic systems?

At this point, a good scientist would do well to observe that if \(f\) determines the evolution of a deterministic dynamical system and its state-space \(X\) is finite so \(|X| < \infty\) then:

\begin{equation} \exists! \tau \in \mathbb{N} \forall k \in [0,|X|], x_{\tau + k} = x_k \end{equation}

where \(\tau\), the recurrence-time of the system, is less than or equal to \(\lvert X \rvert\).

However, chaotic systems are typically ergodic(i.e. strongly mixing) so \(\tau \sim |X|\) and chaotic systems are highly sensitive to initial conditions. So if we make the reasonable assumptions that \(\text{dim}(X)>10\) and each observation \(x \in X\) is measured with a precision of five decimal places:

\begin{equation} |X| > (10^{5})^{10} = 10^{50} \end{equation}

If data storage is not an issue, there are other ethical challenges involved with modelling rare events. For concreteness, let’s suppose you model car crashes at several traffic intersections of a big city. You would have a hard time convincing the department of transportation to collect data for more than a decade before you start building a useful predictive model because behind each data point is a life that could have been saved.

A potential solution:

In spite of what has been said so far, I think a general system for modelling rare events is possible on the condition that we only try to reason a few steps ahead. We don’t need to look further than the human brain for inspiration as it was designed to model rare and hard-to-predict events. If not for this ability, humans would not have been able to adapt to a variety of complex environments.

In particular, I think that a key step in identifying information-theoretic constraints on the space of reasonable solutions requires an analysis of the information geometry of self-organised criticality in the brain. This is important for a couple reasons.

First, information geometry provides us with robust tools for constructing and comparing statistical models. Second, self-organised criticality appears to play a crucial role in cognitive flexibility [2].

References:

  1. Peter D. Grünwald. The Minimum Description Length Principle . MIT Press. 2007.
  2. Luca Cocchi et al. Criticality in the brain: A synthesis of neurobiology, models and cognition. 2017.
  3. Aidan Rocke. Turing’s fixed Point theorem. 2021.
  4. Marcus Hutter. Algorithmic information theory. Scholarpedia. 2007.
  5. Aidan Rocke (https://cstheory.stackexchange.com/users/47594/aidan-rocke), Understanding the Physical Church-Turing thesis and its implications, URL (version: 2021-02-20): https://cstheory.stackexchange.com/q/48450
  6. Aidan Rocke. An ergodic approach to Occam’s razor. 2021.