Motivation:

Given the set of prime numbers \(\mathbb{P} \subset \mathbb{N}\), the Copeland–Erdős constant \(\mathcal{C}\) is defined as [1]:

\begin{equation} \mathcal{C} = \sum_{n=1}^\infty p_n \cdot 10^{-(n+ \sum_{k=1}^n \lfloor \log_{10} p_k \rfloor} \end{equation}

where \(p_n\) is the nth prime number.

Now, it is generally known that \(\mathcal{C}\) is normal and that normal numbers are finite-state incompressible. As machine learning systems are finite-state machines it occurred to me that prime formulas are not approximable by machine learning systems regardless of their computational power [5]. While there are probably several approaches to this question, compression bounds driven by the Shannon source coding theorem seem natural as this theorem demonstrates that in the asymptotic limit it is impossible to compress i.i.d. data such that the average number of bits per symbol is less than the Shannon entropy of the source.

As I could not find such an information-theoretic definition of the normal numbers, I considered the problem until an effective definition occurred to me which allows us to derive compression bounds for any normal number.

An information-theoretic approach to normal numbers:

An information-theoretic definition of normal numbers:

Given the alphabet \(\Sigma\) with \(\lvert \Sigma \rvert = \alpha\) and \(X = \Sigma^{\infty}\) we may define:

\begin{equation} \Sigma^n \cap X_N := \bigcup_{w_i \in \Sigma^n} w_i \cap \{x_i\}_{i=1}^{N-n+1} \end{equation}

where \(x_i = X[i, i +n-1]\) and \(\lim\limits_{N \to \infty} X_N = X\).

In this context, \(X\) is normal to base \(\alpha\) if for any \(Z_N \sim U(\Sigma^n \cap X_N)\) with \(N \gg n\) the average amount of information gained from observing each digit in \(Z_N\) converges to:

\begin{equation} \log_2 \lvert \Sigma^n \rvert = n \cdot \log_2 \lvert \Sigma \rvert \end{equation}

as \(N \to \infty\).

Proof of equivalence with the usual definition:

From a frequentist perspective, we may define the probabilities:

\begin{equation} \forall w_i \in \Sigma^n, p_{w_i} = \lim_{N \to \infty} \frac{\mathcal{N}(X_N,w_i)}{N-n+1} = \frac{1}{\lvert\Sigma^n\rvert} \end{equation}

where \(\mathcal{N}(X_N,w_i)\) counts the number of times the string \(w_i\) appears as a substring of \(X_N\).

We have a uniform distribution over \(\Sigma^n\) in the sense that:

\begin{equation} \forall w_i, w_{j \neq i} \in \Sigma^n,\quad p_{w_i} = p_{w_{j \neq i}} \end{equation}

Now, we define the random variable \(Z_N \sim U(\Sigma^n \cap X_N)\) whose Shannon entropy is given by:

\begin{equation} H(Z_N) = - \sum_{i=1}^{\lvert \Sigma^n \rvert} P(Z_N = w_i) \log_2 P(Z_N = w_i) \end{equation}

which is defined for sufficiently large \(N\) since:

\begin{equation} \forall w_i \in \Sigma^n \forall \epsilon > 0 \exists m \in \mathbb{N} \forall N \geq m, \Bigl\lvert \frac{\mathcal{N}(X_N,w_i)}{N-n+1} - \frac{1}{\lvert\Sigma^n\rvert} \Bigr\rvert < \epsilon \end{equation}

and therefore we have:

\begin{equation} \lim_{N \to \infty} H(Z_N) = \log_2 \lvert \Sigma^n \rvert = n \cdot \log_2 \lvert \Sigma \rvert \end{equation}

Corollary:

For large \(N\), if we break \(X_N\) into \(\lfloor \frac{N}{n}\rfloor\) segments, the information gained from observing all of these segments converges to:

\begin{equation} H(X_N) \approx \big\lfloor \frac{N}{n}\big\rfloor \cdot H(Z_N) \approx N \cdot \log_2 \lvert \Sigma \rvert \end{equation}

Application to the Copeland–Erdős constant:

Given that \(\mathcal{C}\) is normal in base-10, it is finite-state incompressible. In particular, if \(\mathcal{A}\) is the description language for all finite-state automata(which includes all learnable programs) then we may deduce that for large \(N\) [6]:

\begin{equation} \mathbb{E}[K_{\mathcal{A}}(\mathcal{C}_N)] \sim N \cdot \log_2 (10) \end{equation}

where \(\mathcal{C}_N\) denotes the first \(N\) digits of \(\mathcal{C}\) and \(K(\cdot)\) denotes prefix-free Kolmogorov Complexity. This is consistent with the fact that almost all strings are incompressible [7].

It follows that a prime formula is not approximable.

Relation to the Champernowne constant:

A mathematician may point out that this argument may be extended to the Champernowne constant which is also normal and therefore we may argue that a successor function:

\begin{equation} \forall n \in \mathbb{N}, S(n) = n+1 \end{equation}

is not approximable. Surely, this argument must be absurd.

However, a successor function is non-trivial for large integers relative to finite-state machines(i.e. machines with finite memory). In fact, almost all positive integers are incompressible since:

\begin{equation} \forall n \in \mathbb{N}^* \forall k < n, |\{x \in \{0,1\}^*:K(x) \geq n -k \}| \geq 2^n(1-2^{-k}) \end{equation}

where \(\lvert x \rvert = n\), the binary length of \(x\), which may be understood as the machine-code representation of an integer. (12) may be proven using the pigeon-hole principle as was done in [7].

From an information-theoretic perspective, an integer \(N\) is incompressible if we need \(\sim \log_2 (N)\) bits to compress \(N\) and given that every integer has a unique prime factorisation:

\begin{equation} \log_2 N = \log_2 \prod_i p_i^{\alpha_i} = \sum_i \alpha_i \log_2 p_i \end{equation}

this is equivalent to saying that all the information contained in the integers is contained in the prime numbers(which we know relatively little about).

References:

  1. Copeland, A. H. and Erdős, P. “Note on Normal Numbers.” Bull. Amer. Math. Soc. 52, 857-860, 1946.
  2. A. N. Kolmogorov Three approaches to the quantitative definition of information. Problems of Information and Transmission, 1(1):1–7, 1965
  3. Olivier Rioul. This is IT: A Primer on Shannon’s Entropy and Information. Séminaire Poincaré. 2018.
  4. Edward Witten. A Mini-Introduction To Information Theory. 2019.
  5. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. 2014.
  6. Peter Grünwald and Paul Vitányi. Shannon Information and Kolmogorov Complexity. 2010.
  7. Lance Fortnow. Kolmogorov Complexity. 2000.