Kolmogorov’s theory of Algorithmic Probability

An introduction to Kolmogorov’s theory of Algorithmic Probability, which clarifies the notion of entropy, and its application to the game of 20 questions.

Aidan Rocke https://github.com/AidanRocke
10-06-2022

My greatest concern was what to call it. I thought of calling it ‘information,’ but the word was overly used, so I decided to call it ‘uncertainty.’ When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.’-Claude Shannon

Motivation:

You meet a French mathematician at the Reykjavik airport with a million things on his mind but at any moment he is only thinking of one thing. What is the minimum number of questions, asked in sequential order, that you would need to determine what he is thinking about?

\[\begin{equation} \log_2 (10^6) \approx 20 \tag{1} \end{equation}\]

So we might as well play a game of 20 questions. Moreover, the popularity of this game suggests that any human concept may be described using 20 bits of information.

Kolmogorov’s theory of Algorithmic Probability:

But, how should we go about scientific induction? Using Kolmogorov’s theory of Algorithmic Probability, we may apply Occam’s razor to any problem of scientific induction including the sequential game of 20 questions.

In fact, assuming that the Physical Church-Turing thesis is true, Kolmogorov’s theory of Algorithmic Probability formalizes Occam’s razor as it is applied in the natural sciences. This was accomplished through an ingenious combination of Shannon’s theory of Communication with Alan Turing’s theory of Computation.

Using Kolmogorov’s theory, we may formulate Occam’s razor using the three fundamental theorems of Algorithmic Probability before approximating Kolmogorov Complexity, which is limit-computable, using Huffman Coding in order to solve our game of 20 questions.

The implicit assumption here is that you, the second player, are able to encode the knowledge of this French mathematician because you have been through similar education systems.

The fundamental theorems of Algorithmic Probability:

The first fundamental theorem of Algorithmic Probability:

Kolmogorov’s Invariance theorem clarifies that the Minimal Description Language of a dataset is invariant to the choice of Universal Description Language:

\[\begin{equation} \forall x \in \{0,1\}^*, \lvert K_U(x)-K_{U'}(x) \rvert \leq \mathcal{O}(1) \tag{2} \end{equation}\]

where \(K_U(x) = \min_{p} \{|p|: U(p) = x\}\).

The second fundamental theorem of Algorithmic Probability:

Levin’s Universal Distribution effectively formalizes Occam’s razor which solves the age-old problem of scientific induction. Given that any uniquely-decodable code satisfies the Kraft-McMillan inequality, prefix-free Kolmogorov Complexity allows us to derive the Universal Distribution:

\[\begin{equation} P(x) = \sum_{U \circ p = x} 2^{-K_U(p)} \leq 1 \tag{3} \end{equation}\]

where the fact that \(U\) is prefix-free implies that for two distinct descriptions \(p\) and \(p'\) satisfying \(U \circ p = U \circ p' = x\) with \(\lvert p \rvert \leq \lvert p' \rvert\), \(p\) isn’t a substring of \(p'\).

The third fundamental theorem of Algorithmic Probability:

Using Levin’s Coding theorem, we may deduce that the typical frequency of simple concepts is exponentially greater than the typical frequency of complex concepts:

\[\begin{equation} -\log_2 P(x) = K_U(x) + \mathcal{O}(1) \tag{4} \end{equation}\]

In fact, it is also the most concise summary of Occam’s razor as it tells us that the most probable scientific hypothesis ought to be as simple as possible but no simpler.

Maximum Entropy via Occam’s razor:

Given a discrete random variable \(X\) with computable probability distribution \(P\), it may be shown that:

\[\begin{equation} \mathbb{E}[K_U(X)] = \sum_{x \sim P(X)} f_x \cdot K_U(x) = H(X) + \mathcal{O}(1) \tag{5} \end{equation}\]

where \(H(X)\) is the Shannon Entropy of \(X\) in base-2. Thus, learning systems that minimise the KL-Divergence are implicitly applying Occam’s razor.

The game of 20 questions:

  1. The game of 20 questions is played between Alice and Bob who are both assumed to be trustworthy and rational. Thus, Alice and Bob both perform sampling and inference using Levin’s Universal Distribution \(P\) over a shared set of symbols \(\mathcal{A}=\{a_i\}_{i=1}^n\).

  2. For the sake of convenience, we shall assume that \(\mathcal{A}\) represents entries in the Britannica Encyclopedia and \(\log_2 n \approx 20\).

  3. Bob selects an object \(a \in \mathcal{A}\) and Alice determines the object by asking Yes/No queries, encoded using a prefix-code, sampled from \(P(\mathcal{A})\).

  4. Alice’s goal is to minimize the expected number of queries which is equivalent to determining: \[\begin{equation} X \sim P(\mathcal{A}), \mathbb{E}[K_U(X)] = H(X) + \mathcal{O}(1) \tag{*} \end{equation}\]

  5. In this setting, Shannon Entropy may be understood as a measure of hidden information and we shall show that (*) has a solution using Huffman Coding.

The game of 20 questions, or Huffman Coding as an approximation of the Universal Distribution:

The game of 20 questions is a special case of Huffman Coding:

Input:

A set \(\mathcal{A}=\{a_i\}_{i=1}^n\) of symbols and a discrete probability distribution \(P=\{p_i\}_{i=1}^n\) which describes the typical frequency of each symbol.

Output:

A prefix-free code \(C(P)=\{c_i\}_{i=1}^n\) where \(c_i \in \{0,1\}^*\) is the codeword for \(a_i\).

Goal:

Let \(\mathcal{L}(C(P))= \sum_{i=1}^n p_i \cdot \lvert c_i \rvert\) be the weighted length of code \(C\). We want to find \(C\) such that \(\mathcal{L}(C(P))\) is minimized.

If the Universal Description Language \(U\) is prefix-free, we may note that:

\[\begin{equation} X \sim P(\mathcal{A}), H(X) \sim \mathbb{E}[K_U(X)] = \sum_{i=1}^n p_i \cdot K_U(c_i) \leq \sum_{i=1}^n p_i \cdot \lvert c_i \rvert \tag{6} \end{equation}\]

so we have:

\[\begin{equation} X \sim P(\mathcal{A}), H(X) \leq \sum_{i=1}^n p_i \cdot \lvert c_i \rvert \tag{7} \end{equation}\]

Hence, Huffman Coding is an entropy coding method that provides us with a robust approximation to the Expected Kolmogorov Complexity.

Huffman Coding:

A description of Huffman Coding:

  1. The technique works by creating a binary tree of nodes where leaf nodes represent the actual bytes in the input data.

  2. A node may be either a leaf node or an internal node.

  3. Initially, all nodes are leaf nodes which represent the symbol itself and its typical frequency.

  4. Internal nodes represent links to two child nodes and the sum of their frequencies.

  5. As a convention, bit ‘0’ represents following the left child and bit ‘1’ represents following the right child.

The simplest solution to Huffman Coding:

The simplest sorting algorithm uses a priority queue where the node with the lowest probability is given the highest priority:

  1. Create a leaf node for each symbol and add it to the priority queue.

  2. While there is more than one node in the queue:

  1. Remove the two nodes of highest priority(i.e. lowest probability) from the queue.

  2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes’ probabilities.

  3. Add the new node to the queue.

  1. The remaining node is the root node and the tree is complete.

The reader may verify that the time complexity of this algorithm is \(\mathcal{O}(n)\).

A discussion of Information Gained:

Does the solution found via Huffman Coding agree with our intuitions?

Assuming that internal nodes are given labels \(v \in [1, 2 \cdot \lvert \mathcal{A} \rvert]\) while leaf nodes are given labels \(c_i \in \{0,1\}^*\) the information gained from any sequence of queries may be determined from the entropy formula:

\[\begin{equation} S \subset [1, 2 \cdot \lvert \mathcal{A} \rvert], H(S) = \sum_{i \in S} - f_i \cdot \log_2 f_i \tag{8} \end{equation}\]

where the order of the internal nodes may be determined by sorting the vertices \(i \in S\) with respect to their un-normalised entropies \(\{-\log_2 f_i \}_{i \in S}\).

In principle, children of a parent node represent refinements of a similar concept so tree depth represents depth of understanding. This degree of understanding may be measured in terms of entropy \(- f_i \cdot \log_2 f_i\).

References:

  1. A. N. Kolmogorov Three approaches to the quantitative definition of information. Problems of Information and Transmission, 1(1):1–7, 1965

  2. A.N. Kolmogorov. Combinatorial foundations of information theory and the calculus of probabilities. Russian Math. Surveys (1983).

  3. Peter Grünwald and Paul Vitanyí. Shannon Information and Kolmogorov Complexity. Arxiv. 2004.

  4. L.A. Levin. Laws of information conservation (non-growth) and aspects of the foundation of probability theory. Problems Inform. Transmission, 10:206–210, 1974.

  5. Marcus Hutter et al. (2007) Algorithmic probability. Scholarpedia, 2(8):2572.

  6. Walter Kirchherr, Ming Li & Paul Vitányi. The Miraculous Universal Distribution. The Mathematical Intelligencer. 1997.

  7. Yuval Dagan, Yuval Filmus et al. Twenty (simple) questions. Arxiv. 2017.

Citation

For attribution, please cite this work as

Rocke (2022, Oct. 6). Kepler Lounge: Kolmogorov's theory of Algorithmic Probability. Retrieved from keplerlounge.com

BibTeX citation

@misc{rocke2022kolmogorov's,
  author = {Rocke, Aidan},
  title = {Kepler Lounge: Kolmogorov's theory of Algorithmic Probability},
  url = {keplerlounge.com},
  year = {2022}
}