Kepler Lounge: An information-theoretic derivation of the Prime Number Theorem

Aidan Rocke

The only implicit assumption in this derivation is that all the information in the Universe is conserved as it is only in such Universes that Occam’s razor and its corollary, Maximum Entropy inference, are generally applicable.

The Maximum Entropy assumption:

If we know nothing about the distribution of primes, in the worst case we may assume that each prime less than or equal to \(N\) is drawn uniformly from \([1,N]\). So our source of primes is:

\[\begin{equation} X \sim U([1,N]) \tag{1} \end{equation}\]

where \(H(X) = \ln N\) is the Shannon entropy of the uniform distribution.

Average number of bits to encode a prime number:

Now, we may define the prime encoding of \([1,N]\) as the binary sequence \(X_N = \{x_n\}_{n=1}^N\) where \(x_n = 1\) if \(n\) is prime and \(x_n = 0\) otherwise. With no prior knowledge, given that each integer is either prime or not prime, we have \(2^N\) possible prime encodings in \([1,N] \subset \mathbb{N}\).

If there are \(\pi(N)\) primes less than or equal to \(N\) then the average number of bits per arrangement gives us the average amount of information gained from correctly identifying each prime in \([1,N]\) as:

\[\begin{equation} S_c = \frac{\log_2 (2^N)}{\pi(N)} = \frac{N}{\pi(N)} \tag{2} \end{equation}\]

and if we assume a maximum entropy distribution over the primes then we would expect that each prime is drawn from a uniform distribution. So we would have:

\[\begin{equation} S_c = \frac{N}{\pi(N)} \sim \ln N \tag{3} \end{equation}\]

so the average information gained from observing a prime number coincides with average distance between consecutive prime numbers.

Prime Encodings as the outcome of Bernoulli trials:

In fact, given the assumptions the expected information gained from observing each prime in \([1,N]\) is on the order of:

\[\begin{equation} \sum_{k=1}^{N-1} \frac{1}{k} \cdot |(k,k+1]| = \sum_{k=1}^{N-1} \frac{1}{k} \approx \ln N \tag{4} \end{equation}\]

as there are \(k\) distinct ways to sample uniformly from \([1,k]\) and a frequency of \(\frac{1}{k}\) associated with the event that \(k \in \mathbb{P}\). So we are effectively modelling each \(x_k \in X_N\) as a Bernoulli trial \(x_k \sim \textbf{Ber}(\frac{1}{k})\).

This implies that the average number of bits per prime number is given by \(\frac{N}{\pi(N)} \sim \ln N\). Rearranging, we find:

\[\begin{equation} \frac{\pi(N)}{N} \sim \frac{1}{\ln N} \tag{5} \end{equation}\]

which is in complete agreement with the Prime Number Theorem.

The Shannon source coding theorem and the algorithmic randomness of prime encodings:

By the Shannon source coding theorem, we may infer that \(\pi(N)\) primes can’t be compressed into fewer than \(\pi(N) \cdot \ln N\) bits. Furthermore, as the expected Kolmogorov Complexity converges to the Shannon entropy for computable probability distributions:

\[\begin{equation} \mathbb{E}[K(X_N)] \sim \pi(N) \cdot \ln N \sim N \tag{6} \end{equation}\]

where the identification of \(\mathbb{E}[K(X_N)]\) with \(\pi(N) \cdot \ln N\) is a direct consequence of the fact that \(K(X_N)\) measures the information gained from observing \(X_N\) and \(\pi(N) \cdot \ln N\) measures the expected information gained from observing \(X_N\).

From a complementary perspective, the Asymptotic Equipartition Theorem, we may infer that the Algorithmic Probability of a prime number of unknown magnitude occurring in the interval \([1,N]\) is on the order of \(\sim \frac{1}{N}\).

The Asymptotic Equipartition Theorem:

Given that prime encodings may be modeled as a sequence of Bernoulli trials, we may use the Asymptotic Equipartition Property to infer the Typical Probability of observing a prime \(\widehat{x_p} \in [1,N] \cap \mathbb{P}\) of unknown magnitude. To be precise, given the prime encoding \(X_N = \{x_k\}_{k=1}^N\):

\[\begin{equation} -\ln P(\widehat{x_p}) \sim \frac{-\ln P(x_1,...,x_N)}{N} \tag{7} \end{equation}\]

for large \(N\).

Given (4), this expression simplifies to:

\[\begin{equation} \frac{-\ln P(x_1,...,x_N)}{N} = \frac{-\ln \prod_{k=1}^N P(x_k)}{N} = \frac{-\ln \prod_{k=1}^N \frac{1}{k}}{N} = \frac{\ln N!}{N} \sim \ln N \tag{8} \end{equation}\]

Hence, the Algorithmic Probability of a prime number of unknown magnitude in the interval \([1,N]\) converges to:

\[\begin{equation} -\ln P(\widehat{x_p}) \sim \ln N \implies P(\widehat{x_p}) \sim \frac{1}{N} \tag{9} \end{equation}\]

It follows that the probability of observing all \(\pi(N)\) primes at \(\{\widehat{x_p}_{i}\}_{i=1}^{\pi(N)}\) where \(\widehat{x_p}_{i} \sim U([1,N])\) is given by:

\[\begin{equation} P(\widehat{x}_{p_1},...,\widehat{x}_{p_{\pi(N)}}) \sim \big(\frac{1}{N}\big)^{\pi(N)} = e^{-\pi(N) \cdot \ln N} = e^{-N + \mathcal{o}(1)} \tag{10} \end{equation}\]

Moreover, as each prime \(p \in \mathbb{N}\) occurs with typical frequency \(\frac{1}{p}\) we may deduce that:

\[\begin{equation} P(\widehat{x}_{p_1},...,\widehat{x}_{p_{\pi(N)}}) \sim \prod_{n=1}^{\pi(N)} \frac{1}{p_n} = e^{-N + \mathcal{o}(1)} \tag{11} \end{equation}\]

As an immediate corollary of (10), we may recover the expected information gained using a second application of the Asymptotic Equipartition Property:

\[\begin{equation} -\ln P(\widehat{x}_{p_1},...,\widehat{x}_{p_{\pi(N)}}) \sim -\ln \big(\frac{1}{N}\big)^{\pi(N)} \sim \pi(N) \cdot \ln N \sim N \tag{12} \end{equation}\]

which provides us with an important insight into the incompressibility of prime encodings, as the last formula implies that prime numbers of unknown magnitude behave like independent random variables drawn from a uniform distribution.

References:

Aidan Rocke (https://mathoverflow.net/users/56328/aidan-rocke), information-theoretic derivation of the prime number theorem, URL (version: 2021-04-08): https://mathoverflow.net/q/384109
Patrick Billingsley. Prime Numbers and Brownian Motion. The American Mathematical Monthly. 1973.
John A. Wheeler, 1990, “Information, physics, quantum: The search for links” in W. Zurek (ed.) Complexity, Entropy, and the Physics of Information. Redwood City, CA: Addison-Wesley.
Andrew Chi-Chih Yao. Theory and applications of trapdoor functions. In Proceedings of the 23rd IEEE Symposium on Foundations of Computer Science, 1982.
Marcus Hutter et al. (2007) Algorithmic probability. Scholarpedia, 2(8):2572.
Rocke (2022, Jan. 3). Kepler Lounge: Occam’s razor. Retrieved from keplerlounge.com
Rocke (2022, Jan. 7). Kepler Lounge: The Algorithmic Probability of a Prime Number. Retrieved from keplerlounge.com
Peter Grünwald and Paul Vitányi. Shannon Information and Kolmogorov Complexity. 2010.

An information-theoretic derivation of the Prime Number Theorem