An introduction to Kolmogorov’s theory of Algorithmic Probability, which clarifies the notion of entropy, and its application to the game of 20 questions.

My greatest concern was what to call it. I thought of calling it ‘information,’ but the word was overly used, so I decided to call it ‘uncertainty.’ When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.’-Claude Shannon

You meet a French mathematician at the Reykjavik airport with a million things on his mind but at any moment he is only thinking of one thing. What is the minimum number of questions, asked in sequential order, that you would need to determine what he is thinking about?

\[\begin{equation} \log_2 (10^6) \approx 20 \tag{1} \end{equation}\]

So we might as well play a game of 20 questions. Moreover, the popularity of this game suggests that any human concept may be described using 20 bits of information. If we may solve this particular inductive problem, might it be possible to solve the general problem of scientific induction?

Using Kolmogorov’s theory of Algorithmic Probability, we may apply Occam’s razor to any problem of scientific induction including the sequential game of 20 questions. However, it is easy to forget that this required overcoming a seemingly insurmountable scientific obstacle which dates back to Von Neumann:

My greatest concern was what to call it. I thought of calling it ‘information,’ but the word was overly used, so I decided to call it ‘uncertainty.’ When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.’-Claude Shannon

This was accomplished through an ingenious combination of Shannon’s theory of Communication with Alan Turing’s theory of Computation. What emerged from this process is the most powerful generalisation of Shannon’s theory so Kolmogorov Complexity and Shannon Entropy share the same units, and Kolmogorov Complexity elucidates the Shannon Entropy of a random variable as the Expected Description Length of a random variable. Furthermore, by assuming that the Physical Church-Turing thesis is true, Kolmogorov’s theory of Algorithmic Probability formalizes Occam’s razor as it is applied in the natural sciences.

In the case of our game, we may formulate Occam’s razor using the three fundamental theorems of Algorithmic Probability before approximating Kolmogorov Complexity, which is limit-computable, using Huffman Coding in order to solve the game of 20 questions.

The implicit assumption here is that you, the second player, are able to encode the knowledge of the first player because you share a similar education and culture.

*All proofs are in the Appendix.*

Kolmogorov’s Invariance theorem clarifies that the Kolmogorov Complexity, or , of a dataset is invariant to the choice of Turing-Complete language used to simulate a Universal Turing Machine:

\[\begin{equation} \forall x \in \{0,1\}^*, \lvert K_U(x)-K_{U'}(x) \rvert \leq \mathcal{O}(1) \tag{2} \end{equation}\]

where \(K_U(x) = \min_{p} \{|p|: U(p) = x\}\).

The minimal description \(p\) such that \(U \circ p = x\) serves as a natural representation of the string \(x\) relative to the Turing-Complete language \(U\). Moreover, as \(x\) can’t be compressed further \(p\) is an incompressible and hence uncomputable string. This corresponds to a physicists’ intuition for randomness and clarifies the reason why Kolmogorov Complexity is not computable.

It follows that any piece of data has a necessary and sufficient representation in terms of a random string.

Levin’s Universal Distribution effectively formalizes Occam’s razor which solves the age-old problem of scientific induction. Given that any uniquely-decodable code satisfies the Kraft-McMillan inequality, prefix-free Kolmogorov Complexity allows us to derive the Universal Distribution:

\[\begin{equation} P(x) = \sum_{U \circ p = x} 2^{-K_U(p)} \leq 1 \tag{3} \end{equation}\]

where the fact that \(U\) may simulate a prefix-free UTM implies that for two distinct descriptions \(p\) and \(p'\), \(p\) isn’t a substring of \(p'\) and \(p'\) isn’t a substring of \(p\).

In a Computable Universe, given a phenomenon with encoding \(x \in \{0,1\}^*\) generated by a physical process the probability of that phenomenon is well-defined and equal to the sum over the probabilities of distinct and independent causes. The prefix-free criterion is precisely what guarantees causal independence.

Furthermore, Levin’s Universal Distribution formalizes Occam’s razor as the most likely cause for that process is provided by the minimal description of \(x\) and more lengthy explanations are less probable causes.

Using Levin’s Coding theorem, the Algorithmic Probability of a data stream \(x \in \{0,1\}^*\) is defined in terms of its Kolmogorov Complexity:

\[\begin{equation} -\log_2 P(x) = K_U(x) + \mathcal{O}(1) \tag{4} \end{equation}\]

and due to Kolmogorov’s Invariance theorem this probability is independent of our choice of our reference Universal Turing Machine, or Turing-Complete language, used to define \(x \in \{0,1\}^*\).

In a Computable Universe, all probabilities are of a deterministic and frequentist nature and so all stochastic systems have ergodic dynamics. In fact, Levin’s Coding theorem is the most concise summary of Occam’s razor as it tells us that the typical cause of a process \(x\) is provided by the minimal description of \(x\). Hence, the most probable scientific hypothesis ought to be as simple as possible but no simpler.

Although Algorithmic Probability is not computable, as Kolmogorov Complexity is not computable, Levin’s Coding theorem provides us with a framework for machine learning as data compression as we may predict the future behaviour of a physical system by compressing its historical data.

Given a discrete random variable \(X\) with computable probability distribution \(f\), it may be shown that:

\[\begin{equation} \mathbb{E}[K_U(X)] = \sum_{x \sim f(X)} f_x \cdot K_U(x) = H(X) + \mathcal{O}(1) \tag{5} \end{equation}\]

where \(H(X)\) is the Shannon Entropy of \(X\) in base-2.

The Shannon Entropy of a random variable in base-2 is the Expected Description Length of this random variable. Hence, machine learning systems that minimise the KL-Divergence are implicitly applying Occam’s razor.

The game of 20 questions is played between Alice and Bob who are both assumed to be trustworthy and rational. Thus, Alice and Bob both perform sampling and inference using Levin’s Universal Distribution \(P\) over a shared set of symbols \(\mathcal{A}=\{a_i\}_{i=1}^n\).

For the sake of convenience, we shall assume that \(\mathcal{A}\) represents entries in the Britannica Encyclopedia and \(\log_2 n \approx 20\).

Bob selects an object \(a \in \mathcal{A}\) and Alice determines the object by asking Yes/No queries, encoded using a prefix-code, sampled from \(P(\mathcal{A})\).

Alice’s goal is to minimize the expected number of queries which is equivalent to determining: \[\begin{equation} X \sim P(\mathcal{A}), \mathbb{E}[K_U(X)] = H(X) + \mathcal{O}(1) \tag{*} \end{equation}\]

In this setting, Shannon Entropy may be understood as a measure of hidden information and we shall show that (*) has a solution using Huffman Coding.

The game of 20 questions is a special case of Huffman Coding:

**Input:**

A set \(\mathcal{A}=\{a_i\}_{i=1}^n\) of symbols and a discrete probability distribution \(P=\{p_i\}_{i=1}^n\) which describes the typical frequency of each symbol.

**Output:**

A prefix-free code \(C(P)=\{c_i\}_{i=1}^n\) where \(c_i \in \{0,1\}^*\) is the codeword for \(a_i\).

**Goal:**

Let \(\mathcal{L}(C(P))= \sum_{i=1}^n p_i \cdot \lvert c_i \rvert\) be the weighted length of code \(C\). We want to find \(C\) such that \(\mathcal{L}(C(P))\) is minimized.

If the Universal Description Language \(U\) is prefix-free, we may note that:

\[\begin{equation} X \sim P(\mathcal{A}), H(X) \sim \mathbb{E}[K_U(X)] = \sum_{i=1}^n p_i \cdot K_U(c_i) \leq \sum_{i=1}^n p_i \cdot \lvert c_i \rvert \tag{6} \end{equation}\]

so we have:

\[\begin{equation} X \sim P(\mathcal{A}), H(X) \leq \sum_{i=1}^n p_i \cdot \lvert c_i \rvert \tag{7} \end{equation}\]

Hence, Huffman Coding is an entropy coding method that provides us with a robust approximation to the Expected Kolmogorov Complexity.

The technique works by creating a binary tree of nodes where leaf nodes represent the actual bytes in the input data.

A node may be either a leaf node or an internal node.

Initially, all nodes are leaf nodes which represent the symbol itself and its typical frequency.

Internal nodes represent links to two child nodes and the sum of their frequencies.

As a convention, bit ‘0’ represents following the left child and bit ‘1’ represents following the right child.

The simplest sorting algorithm uses a priority queue where the node with the lowest probability is given the highest priority:

Create a leaf node for each symbol and add it to the priority queue.

While there is more than one node in the queue:

Remove the two nodes of highest priority(i.e. lowest probability) from the queue.

Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes’ probabilities.

Add the new node to the queue.

- The remaining node is the root node and the tree is complete.

The reader may verify that the time complexity of this algorithm is \(\mathcal{O}(n)\).

Does the solution found via Huffman Coding agree with our intuitions?

Assuming that internal nodes are given labels \(v \in [1, 2 \cdot \lvert \mathcal{A} \rvert]\) while leaf nodes are given labels \(c_i \in \{0,1\}^*\) the information gained from any sequence of queries may be determined from the entropy formula:

\[\begin{equation} S \subset [1, 2 \cdot \lvert \mathcal{A} \rvert], H(S) = \sum_{i \in S} - f_i \cdot \log_2 f_i \tag{8} \end{equation}\]

where the order of the internal nodes may be determined by sorting the vertices \(i \in S\) with respect to their un-normalised entropies \(\{-\log_2 f_i \}_{i \in S}\). In principle, children of a parent node represent refinements of a similar concept so tree depth represents depth of understanding. This degree of understanding may be measured in terms of entropy \(- f_i \cdot \log_2 f_i\). Hence, we have a satisfactory solution to the Game of 20 questions.

Zooming out, we may consider the ultimate impact of Kolmogorov’s formalisation of scientific induction which Kolmogorov foretold [2]:

Using his brain, as given by the Lord, a mathematician may not be interested in the combinatorial basis of his work. But the artificial intellect of machines must be created by man, and man has to plunge into the indispensable combinatorial mathematics.-Kolmogorov

In fact, Kolmogorov’s theory of Algorithmic Probability may be viewed
a complete theory of machine epistemology. As for what may potentially
limit the scope of machine epistemology relative to human epistemology,
I recommend considering the *big questions* section of [8].

*The following is taken from [4].*

From the theory of compilers, it is known that for any two Turing-Complete languages \(U_1\) and \(U_2\), there exists a compiler \(\Lambda_1\) expressed in \(U_1\) that translates programs expressed in \(U_2\) into functionally-equivalent programs expressed in \(U_1\).

It follows that if we let \(p\) be the shortest program that prints a given string \(x\) then:

\[\begin{equation} K_{U_1}(x) \leq |\Lambda_1| + |p| \leq K_{U_2}(x) + \mathcal{O}(1) \tag{9} \end{equation}\]

where \(|\Lambda_1| = \mathcal{O}(1)\), and by symmetry we obtain the opposite inequality.

This is an immediate consequence of the Kraft-McMillan inequality.

Kraft’s inequality states that given a sequence of strings \(\{x_i\}_{i=1}^n\) there exists a prefix code with codewords \(\{\sigma_i\}_{i=1}^n\) where \(\forall i, |\sigma_i|=k_i\) if and only if:

\[\begin{equation} \sum_{i=1}^n s^{-k_i} \leq 1 \tag{10} \end{equation}\]

where \(s\) is the size of the alphabet \(S\).

Without loss of generality, let’s suppose we may order the \(k_i\) such that:

\[\begin{equation} k_1 \leq k_2 \leq ... \leq k_n \tag{11} \end{equation}\]

Now, there exists a prefix code if and only if at each step \(j\) there is at least one codeword to choose that does not contain any of the previous \(j-1\) codewords as a prefix. Due to the existence of a codeword at a previous step \(i<j, s^{k_j-k_i}\) codewords are forbidden as they contain \(\sigma_i\) as a prefix. It follows that in general a prefix code exists if and only if:

\[\begin{equation} \forall j \geq 2, s^{k_j} > \sum_{i=1}^{j-1} s^{k_j - k_i} \tag{12} \end{equation}\]

Dividing both sides by \(s^{k_j}\), we find:

\[\begin{equation} \sum_{i=1}^n s^{-k_i} \leq 1 \tag{13} \end{equation}\]

QED.

We may begin by observing that \(P(x)\) defines the ‘a priori’ frequency with which the prefix-free UTM \(U\) outputs \(x\) when the input to \(U\) is provided with a program of length \(K_U(p)\) generated by an Oracle capable of Universal Levin Search. From the vantage point of \(U\), the description of this program is incompressible or algorithmically random so it may as well be generated by:

\[\begin{equation} K_U(x) \leq K_U(p) \leq K_U(x) + \mathcal{O}(1) \tag{14} \end{equation}\]

coin flips, where the \(\mathcal{O}(1)\) term comes from Kolmogorov’s Invariance theorem.

Hence, we may observe the upper-bound:

\[\begin{equation} -\log_2 P(x) \leq K_U(x) + \mathcal{O}(1) \tag{15} \end{equation}\]

Likewise, we have the lower-bound:

\[\begin{equation} -\log_2 P(x) \geq K_U(x) \tag{16} \end{equation}\]

Thus, we may deduce that:

\[\begin{equation} -\log_2 P(x) = K_U(x) + \mathcal{O}(1) \tag{17} \end{equation}\]

Conversely, if we consider the natural correspondence between entropy(or information content) and algorithmic probability:

\[\begin{equation} P(x) = \sum_{U \circ p = x} P(U \circ p = x) = \sum_{U \circ p =x} 2^{-K_U(p)} \tag{18} \end{equation}\]

we may derive this result using the Shannon source coding theorem by considering the entropy of the source distribution \(P(x)\).

While the entropy or information content of \(x\) is given by the shortest program of length \(K_U(x)\) that contains the necessary and sufficient axioms for generating \(x\), any program \(p\) satisfying \(U \circ p = x\) where:

\[\begin{equation} |p| - K_U(x) > 0 \tag{19} \end{equation}\]

must have a finite number of additional axioms or hypotheses that are independent of \(U\) and \(x\).

As these additional axioms may be necessary with respect to a different Turing-Complete language \(U'\), we may bound the information content of \(p\) by applying Kolmogorov’s Invariance theorem:

\[\begin{equation} |p| - K_U(x) \leq \mathcal{O}(1) \tag{20} \end{equation}\]

Hence, if we run this coin flipping experiment infinitely many times by sampling each program \(p\) in an i.i.d. manner from Levin’s Universal Distribution:

\[\begin{equation} \forall X \sim P(x), -\log_2 P(x) = \lim_{N \to \infty} \frac{-\log_2 P(X_1,...,X_N)}{N} = \lim_{N \to \infty} \frac{\sum_{i=1}^N -\log_2 P(X_i)}{N} = K_U(x) + \mathcal{O}(1) \tag{21} \end{equation}\]

QED.

Let’s suppose we have a computable probability distribution,

\[\begin{equation} x \sim f(X), \sum_x f(X=x) = 1 \tag{22} \end{equation}\]

In a computable Universe, where all physical laws may be simulated by a UTM, \(f(X=x)\) is equivalent to the a priori frequency with which we observe the event \(X=x\). Hence, if \(P(x)\) is the Algorithmic Probability of \(x\) defined via Levin’s Coding theorem:

\[\begin{equation} -\log_2 P(x) = K_U(x) + \mathcal{O}(1) \tag{23} \end{equation}\]

then the combination of Kolmogorov’s Invariance theorem and the Shannon Source coding theorem allows us to determine that \(-\log_2 f(x)\) is independent of the choice of UTM that simulates the Universe.

In fact, by sampling over independent causes \(p\) in an i.i.d. manner from Levin’s Universal Distribution we may deduce that the a priori frequency must satisfy:

\[\begin{equation} \forall X \sim P(x), -\log_2 f(x) = \lim_{N \to \infty} \frac{-\log_2 P(X_1,...,X_N)}{N} = K_U(x) + \mathcal{O}(1) \tag{24} \end{equation}\]

Using the last two results, it follows that the Expected Kolmogorov Complexity of a random variable \(X\) is equivalent to its Shannon Entropy up to an additive constant:

\[\begin{equation} \mathbb{E}[K_U(X)] = -\sum_{x \sim f(X)} f(x) K_U(X=x) = -\sum_{x \sim f(X)} f(x) \log_2 f(x) + \mathcal{O}(1) = H(X) + \mathcal{O}(1) \tag{25} \end{equation}\]

QED.

Assuming that the evolution of the Quantum State of the Universe may be simulated by the Schrödinger equation, Kolmogorov’s theory of Algorithmic Probability provides us with an elegant mathematical description of what a particular physicist observes during a Quantum Measurement. Interestingly, this description of non-computable measurements is in qualitative agreement with the Von Neumann-Wigner theory of Quantum Measurement.

- Through the paradox of the Von Neumann chain, the Von Neumann-Wigner interpretation of Quantum Mechanics posits that a conscious observer must lie beyond Quantum Computations:

If an observer is a purely physical object, a more comprehensive wave function may now be expressed which encompasses both the state of the Quantum system being observed and the state of the observer. The various possible measurements are now in a superposition of states, representing different observations. However, this leads to a problem: you would now need another measuring device to collapse this larger wave function but this would develop into a superposition state. Another device would be needed to collapse this state ad infinitum. This problem-the Von Neumann chain-is an infinite regression of measuring devices whose stopping point is presumed to be the conscious mind. -Aeowyn Kendall

Von Neumann’s theory of Quantum measurement may thus be summarised as follows: (1) The Quantum state of a system generally evolves smoothly as dictated by the Schrödinger wave equation. (2) Otherwise, the Quantum State of this system collapses suddenly and sharply due to a conscious observer.

If we consider that there is a Quantum State associated with the Universe and combine this with the Kantian view that the mind interprets and generates the world, then what a person observes may be defined by the Algorithmic Probability \(P(x|\hat{x})= 2^{-K_U(x\hat{x})}\) where \(\hat{x}\) denotes the Qualia or conscious state of a person and \(x\) denotes the observations of this person or one of many mutually exclusive Quantum branches of the Universal Wave Function. As Kolmogorov Complexity is not computable, what a particular physicist observes during a Quantum experiment may not be determined by a computable function such as the Schrödinger Wave equation.

This has a number of important consequences:

- Computability defines the epistemological constraint on a Quantum Physicist that can only determine the average frequencies of experimental outcomes. Hence, the process of conscious observation must lie outside the realm of computable functions including the Schrödinger wave equation.
- On a planet of eight billion people there are eight billion
Universes, each evolving with their own Universal Wave Function.

- A hypothetical Quantum Computer would need an Oracle in the loop.

A. N. Kolmogorov Three approaches to the quantitative definition of information. Problems of Information and Transmission, 1(1):1–7, 1965

A.N. Kolmogorov. Combinatorial foundations of information theory and the calculus of probabilities. Russian Math. Surveys (1983).

Peter Grünwald and Paul Vitanyí. Shannon Information and Kolmogorov Complexity. Arxiv. 2004.

Grünwald, P. and Vitanyí, P. Kolmogorov Complexity and Information Theory: With an Interpretation in Terms of Questions and Answers. Journal of Logic, Language, and Information. 2003.

Grünwald, P. and Vitanyí, P. Algorithmic Information Theory. Arxiv. 2008.

L.A. Levin. Laws of information conservation (non-growth) and aspects of the foundation of probability theory. Problems Inform. Transmission, 10:206–210, 1974.

Marcus Hutter et al. (2007) Algorithmic probability. Scholarpedia, 2(8):2572.

Walter Kirchherr, Ming Li and Paul Vitányi. The Miraculous Universal Distribution. The Mathematical Intelligencer. 1997.

Yuval Dagan, Yuval Filmus et al. Twenty (simple) questions. Arxiv. 2017.

Marcus Hutter. A theory of Universal Intelligence based on Algorithmic Complexity. 2000.

Shane Legg and Marcus Hutter. Universal Intelligence: A Definition of Machine Intelligence. 2007.

Aeowyn Kendall. Quantum Mechanics and its Broader Implications: The Von Neumann-Wigner interpretation. 2019.

Hugh Everett. The theory of the Universal Wave Function. Princeton University Press. 1957.

For attribution, please cite this work as

Rocke (2022, Oct. 6). Kepler Lounge: Kolmogorov's theory of Algorithmic Probability. Retrieved from keplerlounge.com

BibTeX citation

@misc{rocke2022kolmogorov's, author = {Rocke, Aidan}, title = {Kepler Lounge: Kolmogorov's theory of Algorithmic Probability}, url = {keplerlounge.com}, year = {2022} }