In the following synthesis, we present Kolmogorov’s theory of Algorithmic Probability which defines the fundamental limits of Machine Learning.

Today, as most math undergrads can attest, early Kolmogorov is well known for his revolutionary measure-theoretic synthesis of probability theory in 1933:

Kolmogoroff, A. Grundbegriffe der Wahrscheinlichkeitsrechnung. Ergebnisse der Mathematik und ihrer Grenzgebiete 2, Nr. 3. Berlin: Julius Springer. IV + 62 S. (1933).

However, late Kolmogorov was also responsible for the counter-revolution where he envisioned the final theory of the Calculus of Probabilities. In 1970, Kolmogorov formulated a vision for the ‘Combinatorial foundations of information theory and the calculus of probabilities’ in relation to a presentation at the International Congress of Mathematicians in Nice(1970). While Kolmogorov admits that his treatment is incomplete, he presents strong arguments for the following conclusions:

Information theory must precede probability theory and not be based on it. By the very essence of this discipline, the foundations of information theory has a finite combinatorial character.

The applications of probability theory can be put on a uniform basis. It is always a matter of consequences of hypotheses about the impossibility of reducing in one way or another the complexity of the description of the objects in question.

In the last statement, Kolmogorov is implicitly referring to the Minimum Description Length or Kolmogorov Complexity of an object.

Until the present day, Kolmogorov’s theory of Algorithmic Probability remains poorly understood although it defines the scope of machine epistemology and underpins the most powerful theory of Artificial General Intelligence known as AIXI [1]. For these reasons, I wrote a self-contained synthesis of Kolmogorov’s theory of Algorithmic Probability as it was envisioned by Kolmogorov and completed by his PhD student, Leonid Levin.

**Acknowledgements:** I would like to thank Ioannis
Kontoyiannis, Cristian Calude, Hector Zenil, Marcus Hutter, Steve
Brunton, Alexander Kolpakov, and Igor Rivin for constructive feedback in
the preparation of this article.

Using Kolmogorov’s theory of Algorithmic Probability, we may apply Occam’s razor to any problem of scientific induction including the sequential game of \(20\) questions. However, it is easy to forget that this requires overcoming a seemingly insurmountable scientific obstacle which dates back to von Neumann:

My greatest concern was what to call it. I thought of calling it ‘information,’ but the word was overly used, so I decided to call it ‘uncertainty.’ When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ’You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage. - Claude Shannon

This was accomplished through an ingenious combination of Shannon’s Theory of Communication with Alan Turing’s Theory of Computation. What emerged is the most powerful generalisation of Shannon’s theory to Kolmogorov Complexity and Shannon Entropy share the same units, and Kolmogorov Complexity elucidates the Shannon Entropy of a random variable as its Expected Kolmogorov Complexity. Furthermore, assuming that the Physical Church-Turing thesis is true, Kolmogorov’s theory of Algorithmic Probability formalizes Occam’s razor as Levin’s Universal Distribution.

In the following synthesis we formally state the four fundamental theorems of Kolmogorov’s theory of Algorithmic Probability before using the fourth theorem, on the asymptotic equivalence of Shannon Entropy and Expected Kolmogorov Complexity, to derive the Erdős-Euclid theorem which implies that the information content of finitely many primes is insufficient to generate all the integers.

Finally, we include formal proofs for each fundamental theorem.

Let \(U\) be a Turing–complete
language that is used to simulate a universal Turing machine. Let \(p\) be an input of \(U\) that produces a given binary string
\(x \in \{0,1\}^*\). Then the
*Kolmogorov Complexity* (or *Minimal Description Length*)
of \(x\) is defined as:

\[\begin{equation} K_U(x) = \min_{p} \{|p|: U\circ p = x\} \end{equation}\]

where \(U \circ p\) denotes the output of \(U\) on input \(p\).

Kolmogorov’s Invariance Theorem states that the above definition is asymptotically invariant of the choice of \(U\). Namely, any other Turing–complete language (or, equivalently, another universal Turing machine) \(U'\) satisfies:

\[\begin{equation} \forall x \in \{0,1\}^* : \lvert K_U(x)-K_{U'}(x) \rvert \leq O(1) \end{equation}\]

Hence,

\[\begin{equation} \forall x \in \{0,1\}^* : - c(U, U') \leq K_U(x) - K_{U'}(x) \leq c(U, U') \end{equation}\] for some positive constant \(c(U, U')\) that depends only on \(U\) and \(U'\).

The minimal description \(p\) such that \(U \circ p = x\) serves as a natural representation of the string \(x\) relative to the Turing–complete language \(U\). However, it turns out that \(p\), and thus \(K_U(x)\), is not computable.

The *Algorithmic Probability* of a binary string \(x\) may be defined as the probability of
\(x\) being generated by \(U\) on random input \(p\), where we consider \(p\) to be a binary string generated by fair
coin flips: \[\begin{equation}
P(x) = \sum_{p \,:\, U\circ p = x} 2^{-|p|}
\end{equation}\]

However, this quantity is not well–defined: we can choose one such input \(p\) and use it as a prefix for some \(p'\) that is about \(\log_2 k\) bits longer than \(p\) and such that \(U\) produces the same binary string: \(U\circ p' = x\). Then we find that:

\[\begin{equation} P(x) \geq 2^{-|p|}\, \sum_k \frac{1}{k} \end{equation}\]

for \(k\) from any subset of integers. Thus, we can’t guarantee that \(P(x)\) converges.

Levin’s idea effectively formalizes Occam’s razor: we need to consider prefix–free Turing–complete languages only. Such languages are easy to imagine: if we agree that all documents end with the instruction \(\text{\verb=\end{document}=}\) that cannot appear anywhere else, then we have a prefix–free language.

Given that any prefix–free code satisfies the Kraft-McMillan inequality, we obtain the Universal Distribution:

\[\begin{equation} 2^{-K_U(x)} \leq P(x) = \sum_{p \,:\, U\circ p = x} 2^{-|p|} \leq 1 \end{equation}\]

where from now on we consider \(U\) to be prefix–free, and \(K_U\) is now corresponds to \(\textit{prefix--free Kolmogorov Complexity}\).

**Interpretation:**

The above facts may also be given a physical interpretation: in a computable Universe, given a phenomenon with encoding \(x \in \{0,1\}^*\) generated by a physical process, the probability of that phenomenon is well–defined and equal to the sum over the probabilities of distinct and independent causes. The prefix–free criterion is precisely what guarantees causal independence.

Furthermore, Levin’s Universal Distribution formalizes Occam’s razor as the most likely cause for that process is provided by the minimal description of \(x\) and more lengthy explanations are less probable causes.

In the setting of prefix–free Kolmogorov complexity, Levin’s Coding theorem states that

\[\begin{equation} -\log_2 P(x) = K_U(x) - O(1) \end{equation}\]

or, in other terms,

\[\begin{equation} P(x) = \Theta\left( 2^{-K_U(x)} \right) \end{equation}\]

**Interpretation:**

Relative to a prefix–free Turing–complete language \(U\) (or, equivalently, a universal prefix–free Turing machine), the number of fair coin flips required to generate the shortest program that outputs \(x\) is on the order of \(\sim K_U(x)\). Thus, from a frequentist perspective, the entropy of the Universal ‘a priori’ Probability that we observe the event \(x\) is on the order of:

\[\begin{equation} -\log_2 P(x) \sim K_U(x) \end{equation}\]

It follows that Levin’s Coding theorem allows us to formalize the
notion of *entropy of an event*.

Given a discrete random variable \(X\) with computable probability distribution \(P\), it holds that

\[\begin{equation} \mathbb{E}[K_U(X)] = \sum_{x \in X} P(x) \cdot K_U(x) = H(X) + O(1) \end{equation}\]

where \(H(X)\) is the Shannon Entropy of \(X\) in base \(2\).

**Interpretation:**

The Shannon Entropy of a random variable in base \(2\) equals Expected Kolmogorov Complexity up to a constant that becomes negligible in asymptotic analyses. This provides us with a precise answer to Von Neumann’s original question. Moreover, machine learning systems that minimise the \(KL\)–Divergence are implicitly applying Occam’s razor. But, what exactly do we mean by random variable?

In a computable Universe the sample space of a random variable \(X\) represents the state–space of a Turing Machine with unknown dynamics whose output sequence is computable. As the generated sequence is computable, it is finite–state incompressible in the worst-case i.e. a normal number. Hence, a random variable corresponds to a stochastic source that is finite–state random.

This definition comes from a well–known correspondence between finite–state machines and normal numbers due to Lempel and Ziv(1978) that establishes that a sequence is normal if and only if it is incompressible relative to any finite–state machine.

*Here we consider a non-trivial application due to Paul Erdős, who
used combinatorial arguments, and Ioannis Kontoyiannis, who found an
information-theoretic formulation of Erdős’ proof [7]. In essence, this
proof demonstrates that the information content of finitely many primes
is insufficient to generate all the integers.*

Let \(\pi(N)\) be the number of primes that are less or equal to a given natural number \(N\). Let us suppose that the set of primes \(\mathbb{P}\) is finite so we have \(\mathbb{P}=\{p_i\}_{i=1}^{\pi(N)}\) where \(\pi(N)\) is constant for \(N\) big enough. Then we can define a uniform integer–valued random variable \(Z \sim U([1,N])\), such that

\[\begin{equation} Z = \big(\prod_{i=1}^{\pi(N)} p_i^{X_i}\big) \cdot Y^2 \tag{1} \end{equation}\]

for some integer–valued random variables \(1 \leq Y \leq N\) and \(X_i \in \{0,1\}\), such that \(Z/Y^2\) is square–free. In particular, we have that \(Y \leq \sqrt{N}\), thus the upper bound for Shannon’s Entropy from Jensen’s inequality implies:

\[\begin{equation} H(Y) \leq \log_2 \sqrt{N} = \frac{1}{2} \log_2 N \tag{2} \end{equation}\]

Also, since \(X_i\) is a binary variable, we have \(H(X_i) \leq 1\).

Using Kolmogorov’s definition of Entropy, we obtain the asymptotic relation:

\[\begin{equation} \mathbb{E}[K_U(Z)] \sim H(Z) = \log_2 N \tag{3} \end{equation}\]

and we may readily obtain the following inequality:

\[\begin{equation} H(Z) = H(Y, X_1, \ldots, X_{\pi(N)}) \leq H(Y) + \sum^{\pi(N)}_{i=1} H(X_i) \leq \frac{1}{2} \log_2 N + \pi(N) \tag{4} \end{equation}\]

which implies:

\[\begin{equation} \pi(N) \geq \frac{1}{2} \log_2 N \tag{5} \end{equation}\]

This clearly contradicts the assumption that \(\pi(N)\) is a constant for any natural \(N\), and provides us with an effective lower bound on the prime counting function.

*The following is taken from [5].*

From the theory of compilers, it is known that for any two Turing-Complete languages \(U_1\) and \(U_2\), there exists a compiler \(\Lambda_1\) expressed in \(U_1\) that translates programs expressed in \(U_2\) into functionally-equivalent programs expressed in \(U_1\).

It follows that if we let \(p\) be the shortest program that prints a given string \(x\) then:

\[\begin{equation} K_{U_1}(x) \leq |\Lambda_1| + |p| \leq K_{U_2}(x) + \mathcal{O}(1) \tag{1} \end{equation}\]

where \(|\Lambda_1| = \mathcal{O}(1)\), and by symmetry we obtain the opposite inequality.

This is an immediate consequence of the Kraft-McMillan inequality.

Kraft’s inequality states that given a sequence of strings \(\{x_i\}_{i=1}^n\) there exists a prefix code with codewords \(\{\sigma_i\}_{i=1}^n\) where \(\forall i, |\sigma_i|=k_i\) if and only if:

\[\begin{equation} \sum_{i=1}^n s^{-k_i} \leq 1 \tag{1} \end{equation}\]

where \(s\) is the size of the alphabet \(S\).

Without loss of generality, let’s suppose we may order the \(k_i\) such that:

\[\begin{equation} k_1 \leq k_2 \leq ... \leq k_n \tag{2} \end{equation}\]

Now, there exists a prefix code if and only if at each step \(j\) there is at least one codeword to choose that does not contain any of the previous \(j-1\) codewords as a prefix. Due to the existence of a codeword at a previous step \(i<j, s^{k_j-k_i}\) codewords are forbidden as they contain \(\sigma_i\) as a prefix. It follows that in general a prefix code exists if and only if:

\[\begin{equation} \forall j \geq 2, s^{k_j} > \sum_{i=1}^{j-1} s^{k_j - k_i} \tag{3} \end{equation}\]

Dividing both sides by \(s^{k_j}\), we find:

\[\begin{equation} \sum_{i=1}^n s^{-k_i} \leq 1 \tag{4} \end{equation}\]

QED.

Let’s suppose that i.i.d. data \(x_{1:n} \in \mathcal{A}^{(n)}\) are generated by sampling from the distribution \(P_X\) on the finite set \(\mathcal{A}\). Then it may be demonstrated that Expected Kolmogorov Complexity equals Shannon Entropy up to an additive constant:

\[\begin{equation} \sum_{x_{1:n} \in \mathcal{A}^(n)} P(X^{(n)}=x_{1:n}) \cdot K_U(X^{(n)} = x_{1:n}) = H(X) + \mathcal{O}(1) \tag{1} \end{equation}\]

Now, by definition:

\[\begin{equation} H(X^{(n)}) = -\sum_{x_{1:n} \in \mathcal{A}^n} P(X^{(n)}=x_{1:n}) \cdot \log_2 P(X^{(n)}=x_{1:n}) \tag{2} \end{equation}\]

where \(P\) is a Universal Distribution that holds for our particular Observable Universe so that: \[\begin{equation} -\log_2 P(X^{(n)}=x_{1:n}) = K_U(X^{(n)}=x_{1:n}) -\mathcal{O}(1) \tag{3} \end{equation}\]

and Unitarity is guaranteed through an Oracle that identifies it with the Universal Wave Function, as \(U\) is a computer that simulates the Observable Universe.

If we carefully consider the Asymptotic Equipartition Theorem,

\[\begin{equation} \lim_{n \to \infty} P(x_{1:n} \in \mathcal{A_{\epsilon}}^{(n)}) = P\big(\lvert \frac{1}{n} \sum_{i=1} \log_2 \frac{1}{P(X_i^{(n)})}-H(X^{(n)}) \rvert \leq \epsilon \big) \rightarrow 1 \tag{4} \end{equation}\]

we may define a natural measure on the atypical set \(E_n = \mathcal{A} \setminus \mathcal{A_{\epsilon}}^{(n)}\):

\[\begin{equation} \mu_n = \frac{1}{\lvert E_n \rvert} \sum_{x_{1:n} \in E_n} P(X^{(n)}=x_{1:n}) \tag{5} \end{equation}\]

Thus, we may observe that \(\lim_{n \to \infty} \mu_n = 0\) so there must exist a positive exponent \(\alpha > 1\) such that as \(n \to \infty\):

\[\begin{equation} \mu_n^{-1} = \mathcal{O}\big(\lvert E_n \rvert^{\alpha} \big) \tag{6} \end{equation}\]

Moreover, given that \(K_U(X^{(n)}=x_{1:n}) \leq \log_2 |E_n|\) for \(x_{1:n} \in E_n\):

\[\begin{equation} \sum_{x_{1:n} \in E_n} P(X^{(n)}=x_{1:n}) \cdot K_U(X^{(n)}=x_{1:n}) \leq \mu_n \cdot |E_n| \cdot \log_2 |E_n| = \mathcal{O}\big(\lvert E_n \rvert^{-\alpha} \big) \tag{7} \end{equation}\]

Hence, we may deduce:

\[\begin{equation} \forall x_{1:n} \in \mathcal{A_{\epsilon}}^{(n)}, K_U(X^{(n)}=x_{1:n}) \sim \log_2 |\mathcal{A_{\epsilon}}^{(n)}| \implies \mathbb{E}[K_U(X^{(n)})] \sim \log_2 |\mathcal{A_{\epsilon}}^{(n)}| \tag{8} \end{equation}\]

Finally, due to the Asymptotic Equipartition Theorem we may determine that:

\[\begin{equation} H(X^{(n)}) \sim \mathbb{E}[K_U(X^{(n)})] \sim \log_2 |\mathcal{A_{\epsilon}}^{(n)}| \tag{9} \end{equation}\]

QED.

If we carefully consider propositions (8) and (9), we may deduce:

\[\begin{equation} \forall x_{1:n} \in \mathcal{A_{\epsilon}}^{(n)}, -\log_2 P(X^{(n)}=x_{1:n}) \sim \log_2 |\mathcal{A_{\epsilon}}^{(n)}| \tag{10} \end{equation}\]

\[\begin{equation} \forall x_{1:n} \in \mathcal{A_{\epsilon}}^{(n)}, K_U(X^{(n)}=x_{1:n}) \sim \log_2 |\mathcal{A_{\epsilon}}^{(n)}| \tag{11} \end{equation}\]

Thus, we may derive Levin’s Coding theorem for i.i.d. time-series data:

\[\begin{equation} -\log_2 P(X^{(n)}=x_{1:n}) \sim K_U(X^{(n)}=x_{1:n}) \sim \log_2 |\mathcal{A_{\epsilon}}^{(n)}| \tag{12} \end{equation}\]

which may be readily generalised to non-stationary data via the Asymptotic Equipartition Theorem.

Why would Levin’s Universal Distribution effectively describe the
Universal ‘A Priori’ Probability that an event(with a prefix-free
encoding) is observed in the Physical Sciences? Markus Müller, a
physicist at the Vienna Institute of Quantum Information, provides us
with a convincing explanation. In *Law without Law*[8], Müller
uses Kolmogorov’s theory of Algorithmic Probability to derive a
combinatorial reformulation of the Universal Wave Function where Levin’s
Universal Distribution predicts the statistical behavior of Physical
Reality.

Considering Markus Müller’s sound analysis, it is worth carefully considering the priority of the Universal Wave Function as the fundamental reason for the reliability of Occam’s razor as a scientific method.

- Marcus Hutter et al. (2007) Algorithmic probability. Scholarpedia, 2(8):2572.
- R. J. Solomonoff. A formal theory of inductive inference: Parts 1 and 2. Information and Control, 7:1–22 and 224–254, 1964.
- L.A. Levin. Laws of information conservation (non-growth) and aspects of the foundation of probability theory. Problems Information Transmission, 10(3):206-210, 1974.
- Ming Li and Paul Vitanyí. An Introduction to Kolmogorov Complexity and Its Applications. Springer, 2019.
- A. N. Kolmogorov. Combinatorial foundations of information theory and the calculus of probabilities. RussianMath. Surveys, 1983.
- Ziv, J.; Lempel, A. “Compression of individual sequences via variable-rate coding”, IEEE Transactions on Information Theory. 1978.
- Ioannis Kontoyiannis. Some information-theoretic computations related to the distribution of prime numbers. In Peter Grunwald, Petri Myllymaki, Ioana Tabus, Marcelo Weinberger, and Bin Yu, editors, Festschrift in Honor of Jorma Rissanen, pages 135-143, Tampere University Press, May 2008.
- Markus Müller. Law without law: from observer states to physics via algorithmic information theory. Arxiv. 2020.

For attribution, please cite this work as

Rocke (2023, Aug. 2). Kepler Lounge: Kolmogorov's theory of Algorithmic Probability. Retrieved from keplerlounge.com

BibTeX citation

@misc{rocke2023kolmogorov's, author = {Rocke, Aidan}, title = {Kepler Lounge: Kolmogorov's theory of Algorithmic Probability}, url = {keplerlounge.com}, year = {2023} }