Introduction:

The following invariance theorem for statistical learning in the setting of algorithmically random data occurred to me yesterday. This theorem indicates that the property of algorithmic randomness is transformation-invariant. This feature may be used to perform empirical tests using machine learning to determine whether a dataset was sampled from an algorithmically random process as well as to test claims that a machine learning model can efficiently simulate algorithmically random processes.

There is one important caveat. I am implicitly interested in natural signals which exhibit algorithmic randomness, and the transformations $$T$$ are arbitrary to the extent that they don’t change the phase-space dimension of the processes responsible for generating natural signals.

Invariance theorems for algorithmically random data:

The learning problem:

Let’s suppose $$F_{\theta}$$ is a function space that is useful for statistical learning(ex. Deep Neural Networks) such that $$D_N = (X_N,Y_N)$$ is a dataset where $$X_N=\{X_i\}_{i=1}^{2N}$$, the $$X_i$$ are sampled i.i.d. from a discrete random variable $$X$$, and $$Y_N=\{Y_i\}_{i=1}^{2N} \in \{0,1\}^{2N}$$ are the class labels.

If $$D_N$$ is split into balanced training and test sets $$D_N^\text{train}$$ and $$D_N^\text{test}$$ then we have $$D_N^\text{train}=(X_N^\text{train},Y_N^\text{train})$$ and $$D_N^\text{test}=(X_N^\text{test},X_N^\text{test})$$ such that $$\lvert X_N^\text{train} \rvert = \lvert X_N^\text{test} \rvert = N$$ and $$X_N^\text{train} \cup X_N^\text{test} = X_N$$.

Furthermore, we shall assume that the training and test sets are balanced so we have:

$$\frac{1}{N} \sum_i Y_i^\text{train} = \frac{1}{N} \sum_i Y_i^\text{test} = \frac{1}{2}$$

Defining algorithmic randomness:

Now, $$X$$ is algorithmically random relative to a class of universal function approximators $$F_{\theta}$$ if $$X$$ is incompressible relative to $$F_{\theta}$$ in the sense that for any choice of train-test split we have:

$$\mathbb{E}[K_{F_{\theta}}(X)] = \min_{f \in F_{\theta}} [H(X_N \lvert f) + H(f)] \sim N \cdot \ln \lvert X \rvert$$

where I used the fact that the expected Kolmogorov Complexity equals the Shannon entropy, and applied the Shannon source-coding theorem to i.i.d. data.

The theorem:

In consequence, (1) implies that for a particular sample $$X_N \sim X$$ it may be possible to find $$f \in F_{\theta}$$ such that $$\forall i \in [1,2N], f(X_i) = Y_i$$ on the condition that the Kolmogorov Complexity:

$$K(f) \sim 2N$$

as this would amount to memorisation and not discovering regularities in $$D_N$$ since $$X$$ is algorithmically random relative to $$F_{\theta}$$. To be precise, if an arbitrary mathematical transformation $$T$$(ex. random matrix) is applied to $$D_N$$ so:

$$X’_N = T \circ X_N$$

given that $$f(X_i)=Y_i \implies \delta_{f(X_i),Y_i} = 1$$, the solution to the empirical risk minimisation problem may be formulated as follows:

$$\hat{f} = \max_{f \in F_{\theta}} \frac{1}{N} \sum_{i=1}^N \delta_{f(X’_i),Y_i}$$

and for large $$N$$, the expected performance on the test set is approximated by:

$$\frac{1}{N} \sum_{i=N+1}^{2N} \delta_{\hat{f}(X’_i),Y_i} \leq \frac{1}{2}$$

A corollary for sequential prediction problems:

This theorem is readily generalised to the sequential prediction problem:

$$\exists f \in F_{\theta}, X_{n+1} = f \circ X_n$$

where $$X_n$$ is a k-bit encoding of a natural signal. So if we define $$X_N^\text{train} = \{X_i\}_{i=1}^N, X_N^\text{test} = \{X_i\}_{i=1}^N$$, and $$X_{n+1} = f \circ X_n \implies \delta_{\hat{f}(X_{n}),X_{n+1}} = 1$$ then for $$X_n$$ sampled i.i.d. from an incompressible discrete random variable $$X$$ any solution to the empirical risk minimisation problem:

$$\hat{f} = \max_{f \in F_{\theta}} \frac{1}{Nk} \sum_{n=1}^{N} \sum_{i=1}^k \delta_{\hat{f}(X_{n}^i),X_{n+1}^i}$$

we would observe for large $$N$$:

$$\frac{1}{Nk} \sum_{n=N+1}^{2N} \sum_{i=1}^k \delta_{\hat{f}(X_{n}^i),X_{n+1}^i} \approx \frac{1}{2}$$

and this result would hold for any transformation $$T$$ applied to the data that does not change the phase-space dimension of the data-generating process.

Discussion:

It is worth noting that the first theorem generalises easily to imbalanced data by defining:

$$\hat{Y}_i :=\hat{f}(X’_i)$$

$$\phi_i := \delta_{\hat{Y}_i, Y_i}$$

$$N_1 = \sum_i \delta_{Y_i,1}$$

$$N_0 = \sum_i \delta_{Y_i,0}$$

and so for large $$N$$, (6) becomes:

$$\frac{1}{2} \sum_i \Big(\frac{\delta_{Y_i,1} \cdot \phi_i}{N_1} + \frac{\delta_{Y_i,0} \cdot \phi_i}{N_0} \Big) \leq \frac{1}{2}$$

which implies that the true-positive rate is an unbiased estimator of (14):

$$\sum_i \Big(\frac{\delta_{Y_i,1} \cdot \phi_i}{N_1} \Big) \leq \frac{1}{2}$$

and I will note that (9) may be easily generalised in a similar manner.

As for a concrete application area, this theorem may be used to detect whether alien civilisations are advanced enough to know about the distribution of prime numbers. Unlike most natural signals, this data will appear to be incompressible to us due to the observer-dependence theorems. This may also be used for more prosaic applications such as detecting whether a machine learning model outperforms the best probabilistic primality testing algorithms.

If designed in an intelligent manner such a machine learning benchmark may significantly improve our understanding of the Riemann Hypothesis.

References:

1. Peter Grünwald and Paul Vitányi. Shannon Information and Kolmogorov Complexity. 2010.
2. Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie. The Elements of Statistical Learning. Springer. 2001.
3. Andrew Chi-Chih Yao. Theory and applications of trapdoor functions. In Proceedings of the 23rd IEEE Symposium on Foundations of Computer Science, 1982.