Science can’t solve the ultimate mystery of nature. And that is because, in the last analysis, we ourselves are a part of the mystery that we are trying to solve.-Planck


Any application of the principles of Algorithmic Information Theory, Occam’s razor in particular, implicitly assumes the Physical Church-Turing thesis. Moreover, as Occam’s razor has had unrivalled success as a general method for model selection in the natural sciences, its success merits closer investigation. Upon closer inspection, an algorithmic analysis of Occam’s razor yields a description in terms of the Expected Kolmogorov Complexity of an observed data stream.

If \(\Omega\) is identified with what physicists call an observer i.e. a Turing Machine that collects data \(X\) and tests hypotheses concerning the probabilistic structure of \(X\) in order to discover \(P_X\) then we find that relative to this observer:

\begin{equation} \mathbb{E}[K(X)] = H(X|\Omega) + H(\Omega) \end{equation}

where \(H(\Omega)\) is the inextricable epistemic uncertainty of \(\Omega\) concerning its own operation and \(H(X|\Omega)\) is the inextricable epistemic uncertainty of \(\Omega\) relative to \(P_X\). It may be demonstrated that \(\mathbb{E}[K(X)]\) is simultaneously a measure of incompressibility, algorithmic randomness and incompleteness where incompleteness is understood as a measure of inextricable epistemic uncertainty.

Within the context of the Physical Church-Turing thesis, this relation has several non-trivial implications:

  1. The scientific method whose epistemic limits are defined by (1) is not observer-independent. It follows that algorithmic randomness in particular is not observer-independent.

  2. The most fundamental theories in physics, Quantum Field theories, will appear to describe algorithmically random phenomena.

  3. If theoretical physics ultimately has an appropriate axiomatisation, such an axiomatisation is beyond the reach of Turing Machines.

  4. Consciousness is potentially within the scope of the scientific method if it may be approximated by an algorithmic description, but if it does then such a description is beyond the scope of the scientific method.

  5. Every Universal Turing Machine must contain an encoding of arithmetic and therefore an encoding of the natural numbers. However, an optimal encoding(i.e. the distribution of prime numbers) is incompressible and therefore algorithmically random.

In fact, all these results are implied by the fact that:

\begin{equation} \mathbb{E}[K(X)] \geq H(\Omega) \end{equation}

which implies that all computations are fundamentally Bayesian, all observations are computations, and therefore if \(\Omega\) contains a necessary and sufficient model for fundamental physics then such a model is beyond the knowledge of \(\Omega\).

If the reader considers the Physical Church-Turing thesis somewhat eccentric, it is worth noting that this thesis is implicitly assumed by most theoretical physicists as well as a large number of applied mathematicians who take its unreasonable effectiveness for granted every day. Having said that, the reader may desire an analysis of a specific case.

We may begin out analysis by considering the algorithmic randomness that is observed in the distribution of prime numbers. From (2) we may deduce that the distribution of the prime numbers is incompressible, i.e. algorithmically random, from the perspective of a Turing Machine. Let’s assume that is the case and consider its implications.

An information-theoretic derivation of the prime number theorem:

If we know nothing about the primes in the worst case we may assume that each prime number less than or equal to \(N\) is drawn uniformly from \([1,N]\). So our source of primes is:

\begin{equation} X \sim U([1,N]) \end{equation}

where \(H(X) = \ln(N)\) is the Shannon entropy of the uniform distribution.

Now, given a strictly increasing integer sequence of length \(N\), \(U_N = \{u_i\}_{i=1}^N\) where \(u_i = i\) we may define the prime encoding of \(U_N\) as the binary sequence \(X_N = \{x_i\}_{i=1}^N\) where \(x_i =1\) if \(u_i\) is prime and \(x_i=0\) otherwise. With no prior knowledge, given that each integer is either prime or not prime, we have \(2^N\) possible prime encodings(i.e. arrangements of the primes) in \([1,N] \subset \mathbb{N}\).

If there are \(\pi(N)\) primes less than or equal to \(N\) then the average number of bits per arrangement gives us the average amount of information gained from correctly identifying each prime number in \(U_N\) as:

\begin{equation} S_c = \frac{\log_2 (2^N)}{\pi(N)}= \frac{N}{\pi(N)} \end{equation}

Furthermore, if we assume a maximum entropy distribution over the primes then we would expect that each prime is drawn from a uniform distribution as in (1) so we would expect:

\begin{equation} S_c = \frac{N}{\pi(N)} \sim \ln(N) \end{equation}

As for why the natural logarithm appears in (3), we may first note that the base of the logarithm in the Shannon Entropy may be freely chosen without changing its properties. Moreover, given the assumptions if we define \((k,k+1] \subset [1,N]\) the average distance between consecutive primes is given by the sum of weighted distances \(l\):

\begin{equation} \sum_{k=1}^{N-1} \frac{1}{k} \lvert (k,k+1] \rvert = \sum_{k=1}^{N-1} \frac{1}{k} \approx \sum_{l=1}^\lambda l \cdot P_l \approx \ln(N) \tag{4} \end{equation}

where \(P_l = \frac{1}{l} \cdot \sum_{k= \frac{l \cdot (l-1)}{2}}^{\frac{l\cdot (l-1)}{2}+l-1} \frac{1}{k+1}\) and \(\lambda = \frac{\sqrt{1+8(N+1)}-1}{2}\).

This is consistent with the maximum entropy assumption in (1) as there are \(k\) distinct ways to sample uniformly from \([1,k]\) and a frequency of \(\frac{1}{k}\) associated with the event that a prime lies in \((k-1,k]\). The computation (4) is also consistent with Boltzmann’s notion of entropy as a measure of possible arrangements.

There is another useful interpretation of (4). If we break \(\sum_{k=1}^{N} \frac{1}{k}\) into \(\pi(N)\) disjoint blocks of size \([p_k,p_{k+1}]\) where \(p_k,p_{k+1} \in \mathbb{P}\) and \(p_1 = 1\):

\begin{equation} \sum_{k=1}^{N} \frac{1}{k} \approx \sum_{k=1}^{\pi(N)} \sum_{b=p_k}^{p_{k+1}} \frac{1}{b} = \sum_{k=1}^{\pi(N)} (p_{k+1}-p_k)\cdot P(p_k) \approx \ln(N) \tag{5} \end{equation}

where \(P(p_k)= \frac{1}{(p_{k+1}-p_k)} \sum_{b=p_k}^{p_{k+1}} \frac{1}{b}\). So we see that (4) also approximates the expected number of observations per prime where \(P(p_k)\) may be interpreted as the probability of a successful observation in a frequentist sense. This is consistent with John Wheeler’s it from bit interpretation of entropy [5], where the entropy measures the average number of bits(i.e. yes/no questions) per prime number.

Now, given (3),(4) and (5) this implies that the average number of bits per prime number is given by:

\begin{equation} \pi(N) \sim \frac{N}{\ln(N)} \end{equation}

which is in complete agreement with the predictions of the prime number theorem.

By the Shannon source coding theorem, we may also infer that \(\pi(N)\) primes can’t be compressed into fewer than \(\pi(N) \cdot \ln(N)\) bits. So the primes are incompressible. In fact, given that the expected Kolmogorov Complexity equals the Shannon entropy for computable probability distributions for a prime encoding of length \(N\) we must asymptotically obtain:

\begin{equation} \mathbb{E}[K(X_N)] \sim \pi(N) \cdot \ln(N) \sim N \end{equation}

where \(K(\cdot)\) is the Kolmogorov Complexity. This result has two interesting implications.

Exploring the Monte Carlo Hypothesis with state of the art machine learning methods:

Within the epistemological framework of modern machine learning any scientific problem may be formulated as a machine learning problem. If we take this framework for granted then the epistemic limits of the scientific method is given by (1) from which we derived the maximum entropy model for the distribution of prime numbers. In particular, my model implies that independently of the amount of data and computational resources at their disposal, if the best machine learning model predicts the next \(N\) primes to be at \(\{a_i\}_{i=1}^N \in \mathbb{N}\) then for large \(N\) their model’s statistical performance will converge to an accuracy that is no better than:

\begin{equation} \frac{1}{N}\sum_{i=1}^N \frac{1}{a_i} \end{equation}

This implies that the true positive rate of any prime formula with bounded algorithmic information content converges to zero.

Another verifiable prediction is that prime encodings are algorithmically random since:

\begin{equation} \mathbb{E}[K(X_N)] \sim N \end{equation}

and so they would pass the next-bit test.

In order to allow for a structured examination of competing hypotheses, I created neutral territory; a machine learning challenge where different models may be rigorously evaluated. The reader is welcome to participate in this challenge which starts on the 17th of March and ends on the 17th of August. Results will be announced on the 17th of September.

Meanwhile, several important remarks are in order:

  1. It is worth noting that this hypothesis is not reducible to the Prime Number Theorem as it makes predictions based on principles from algorithmic information theory. Nor could I have come up with this hypothesis by means of large-scale machine learning.

  2. That said, this hypothesis could not have been formulated without a good understanding of machine learning methods. Their scope and their limits, as defined by the observer-dependence theorem (1).

  3. This analysis suggests that the Prime Number Theorem is a fundamental theorem in algorithmic information theory that was first discovered by number theorists. It is fundamental in the sense that every Universal Turing Machine must contain an encoding of arithmetic and therefore an encoding of the natural numbers.


The reader may wonder why the Monte Carlo Hypothesis is so-named. There is a natural reason for that.

The Monte Carlo Hypothesis implies that civilisations within the Turing limit will have, at best, a probabilistic understanding of the distribution of prime numbers. Moreover, investigating the source of randomness in the distribution of prime numbers represents nothing less than a scientific investigation into the origins of randomness in the natural world. I would like to add that while the machine learning challenge has not started, I would like to elaborate the implications of a theory that goes beyond the Monte Carlo hypothesis.

An alternative hypothesis would have to be both incompatible with Algorithmic Information Theory and provide a general method for model selection that is more powerful than Occam’s razor. In fact, this machine learning challenge represents a fundamental test of both algorithmic information theory and the Physical Church-Turing thesis.


  1. Dániel Schumayer and David A. W. Hutchinson. Physics of the Riemann Hypothesis. Arxiv. 2011.
  2. Doron Zagier. Newman’s short proof of the Prime Number Theorem. The American Mathematical Monthly, Vol. 104, No. 8 (Oct., 1997), pp. 705-708
  3. Peter D. Grünwald. The Minimum Description Length Principle . MIT Press. 2007.
  4. M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Graduate Texts in Computer Science. Springer. 1997.
  5. Peter Shor. Shannon’s noiseless coding theorem. lecture notes. 2010.
  6. Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie. The Elements of Statistical Learning. Springer. 2001. AlphaGo Zero. Kepler Lounge. 2019.
  7. Andrew Chi-Chih Yao. Theory and applications of trapdoor functions. In Proceedings of the 23rd IEEE Symposium on Foundations of Computer Science, 1982.
  8. Sullivan, Walter (September 29, 1968). “The Universe Is Not Ours Alone; From the edge of the galaxy”. The New York Times.
  9. Alexander L. Zaitsev. Rationale for METI. Arxiv. 2011.
  10. Jérôme Durand-Lose. Black hole computation: implementations with signal machines. 2008.
  11. Aidan Rocke. An ergodic approach to Occam’s razor. 2021.
  12. Aidan Rocke (, information-theoretic derivation of the prime number theorem, URL (version: 2021-02-20):
  13. Aidan Rocke.