# An ergodic approach to Occam's razor

## Introduction:

In this article, I will show how an algorithmic Occam’s razor may be derived using the Expected Kolmogorov Complexity of a discrete random variable. Unlike mainstream Algorithmic Information Theory we won’t appeal to dubious mathematical notions such as the ‘Universal Distribution’ which actually lead to an incorrect answer.

Instead, we shall rely upon a combination of Bayesian and ergodic perspectives that are both necessary and sufficient.

## A Bayesian perspective on computation:

As computations are observer-dependent, computation is fundamentally Bayesian. In particular, the uncertainty associated with a discrete random variable is defined with respect to a predictive model. It follows that Occam’s razor for a discrete random variable \(X\) is a measure of epistemic uncertainty or the memory requirements of the ideal model for predicting the behaviour of \(X\).

In this setting, the minimum description length of \(X\) is given by the most parsimonious model \(\Omega\) for predicting the behaviour of \(X\):

\begin{equation} \mathbb{E}[K(X)] = \mathbb{E}[-\ln P(X\lvert \Omega)P(\Omega)] = H(X \lvert \Omega) + H(\Omega) \end{equation}

where \(H(\Omega)\) is the complexity of the model \(\Omega\) and \(H(X \lvert \Omega)\) is the expected information gained by \(\Omega\) from observing \(X\).

The reason why the expression in (1) has a probabilistic representation is that the behaviour of a discrete random variable is described by its probability distribution. From the perspective of Algorithmic Information Theory, civilisations within the Turing limit use probabilistic models and therefore the Expected Kolmogorov Complexity to describe incompressible data-generating processes because their understanding of algorithmically random processes is inherently probabilistic.

From an ergodic perspective, these probabilities also have a natural frequentist interpretation.

## An ergodic analysis:

Given that an event that occurs with frequency \(p\) generally requires modelling a sequence of length \(\sim \frac{1}{p}\), in order to encode the structure of such an event, the machine would generally need a number of bits proportional to:

\begin{equation} \ln(\frac{1}{p}) = - \ln(p) \end{equation}

But, how should we define the constant of proportionality? If we assume that the memory of the machine is finite and that the data-generating process is ergodic, an optimal encoding would use the expected number of bits:

\begin{equation} -p \cdot \ln(p) \end{equation}

in order to encode an event that occurs with frequency \(p\).

Regarding the assumptions, I may make a couple remarks. First, the ergodic assumption is equivalent to the premise that scientific experiments are repeatable in the natural sciences. As for finite memory, all Turing machines have finite memory.

## Discussion:

I would like to point out that mainstream Algorithmic Information Theory insists that the Expected Kolmogorov Complexity of a discrete random variable equals its Shannon entropy:

\begin{equation} \mathbb{E}[K(X)] = H(X) \end{equation}

by appealing to contrived mathematical notions such ‘Universal distributions’ which may ‘solve’ the No Free Lunch problem [1,2]. The same AIT researchers complain that there aren’t many scientists using Algorithmic Information Theory for predictive modelling.

The implicit error in (4) is to assume that computations are observer-independent. Since we can’t remove the observer in (1), we have:

\begin{equation} X = \Omega \implies \mathbb{E}[K(\Omega)] = H(\Omega) \end{equation}

So information that is truly incompressible is self-referential.

## References:

- Peter Grünwald and Paul Vitányi. Shannon Information and Kolmogorov Complexity. 2010.
- Tom Everitt, Tor Lattimore, Marcus Hutter. Free Lunch for Optimisation under the Universal Distribution. 2016.