Compressibility implies learnability:

Let’s suppose we have a dynamical system:

\begin{equation} \forall x_n \in X, x_{n+1} = f \circ x_n \end{equation}

from which we may collect samples:

\begin{equation} X_N = \{x_n\}_{n=1}^N \end{equation}

We say that the data-stream \(X_N\) is compressible if its algorithmic information content is bounded:

\begin{equation} \exists k,C \in \mathbb{N} \forall N \in \mathbb{N}, K(X_N) \leq k \cdot \log(C) \end{equation}

where \(\log(C)\) denotes the average number of bits required to encode a state in the state-space \(X\) and \(X_k\) represents the maximum number of samples required by the optimal learning algorithm \(\mathcal{A}\) in order to produce a predictive model \(M\) that can reconstruct \(X_N\) with negligible error. It follows that the sample-complexity of \(\mathcal{A}\) is bounded by \(k\) and the state-transition map \(f\) is PAC-learnable.

Moreover, if we define the expectation:

\begin{equation} \mathbb{E}[K(X_N)] = \lim_{n \to \infty} \frac{1}{n} \sum_{N=1}^n K(X_N) \end{equation}

then we may express (3) succinctly by the asymptotic compressibility relation:

\begin{equation} \mathbb{E}[K(X_N)] \leq k \cdot \log(C) \end{equation}

which may be understood to mean that the average size of a PAC-learnable machine learning model \(M\) that approximates \(f\) is not greater than \(k \cdot \log(C)\).

Note: It is worth noting that (5) implicitly assumes that \(f\) simulates an ergodic process. In fact, \(f\) would not be learnable otherwise.

State-transition maps that are not learnable are not compressible:

On the other hand, let’s suppose that instead of the situation in (3) that \(X_N\) is Schnorr-random so no martingale succeeds on it and so the sample-complexity of \(\mathcal{A}\) scales with \(N\):

\begin{equation} \forall N \in \mathbb{N}, K(X_N) \sim N \end{equation}

hence the sample-complexity is unbounded and therefore \(f\) is not learnable.

From (6) we may deduce the asymptotic incompressibility relation:

\begin{equation} \mathbb{E}[K(X_N)] \sim N \end{equation}

so we may conclude that \(X_N\) is generally incompressible.


\(1.\) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. 2014.

\(2.\) Kakade, Sham (2003), On the Sample Complexity of Reinforcement Learning (PDF), PhD Thesis, University College London: Gatsby Computational Neuroscience Unit.

\(3.\) Leslie Valiant. Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World. Basic Books. 2013.

\(4.\) Peter D. Grünwald. The Minimum Description Length Principle . MIT Press. 2007.

\(5.\) Schnorr, C. P. (1971). “A unified approach to the definition of a random sequence”. Mathematical Systems Theory.

\(6.\) A. N. Kolmogorov Three approaches to the quantitative definition of information. Problems of Information and Transmission, 1(1):1–7, 1965