# An information-theoretic approach to function-approximation

## Introduction:

Within the setting of rare-event modelling, the ability to predict the long-term behaviour of complex systems is essential. As identifying the appropriate hypothesis requires an algorithmic formulation of Occam’s razor it is sensible to develop such a methodology within the context of algorithmic information theory.

In what follows \(K(\cdot)\) is used to denote prefix-free Kolmogorov Complexity with respect to an appropriate Turing-complete programming language.

## Approximable dynamical systems:

A dynamical system \(f\) that evolves on a compact metric space \((X,\lVert \cdot \rVert)\):

\begin{equation} \forall x_n \in X, x_{n+1} = f \circ x_n \end{equation}

\begin{equation} f^{n+1} = f \circ f^n \end{equation}

is said to be \(\epsilon\)-approximable if it satisfies the following conditions:

\(1.\) There exists a finite set \(\Sigma \subset X\) such that:

\begin{equation} \forall x \in X \exists x’ \in \Sigma, \lVert x-x’ \rVert < \epsilon \end{equation}

which implies that a sufficiently large dataset \(\Sigma\) is a suitable approximation of \(X\).

\(2.\) The algorithmic complexity of \(f\) is bounded:

\begin{equation} \exists N \in \mathbb{N}, N \cdot \log_2 \lvert \Sigma \rvert \leq \max_{x \in \Sigma} K(\{f^n \circ x\}_{n=1}^{\infty}) \leq (N+1) \cdot \log_2 \lvert \Sigma \rvert \end{equation}

which implies that given an initial state it is possible to simulate \(f\) on a computer with finite memory. As a concrete example, if \(f\) is computable and periodic then the shortest program returned by \(K(\{f^n \circ x\}_{n=1}^{\infty})\) can’t be longer than the minimum description length of \((f, x, \infty)\).

\(3.\) There exists a space of computable functions \(F\) with \(\hat{f} \in F\) satisfying \(K(\hat{f}) \sim N \cdot \log \lvert \Sigma \rvert\) and:

\begin{equation} \forall x \in \Sigma \forall N \in \mathbb{N}, \frac{\sum_{n=1}^N \lVert \hat{f}^n \circ x - f^n \circ x \rVert}{N} < \epsilon \end{equation}

which implies finite-precision sensitivity to initial conditions. This is a fundamental constraint on real-world machine learning as the correct model can’t be too sensitive to the choice of random seed for example. In addition, (5) implies that a suitable loss function \(\mathcal{L}(f,\hat{f})\) may be readily derived from the metric \(\lVert \cdot \rVert\).

## Theorem 1: Approximable systems are ergodic

Given the bounded complexity criterion for approximability, we may define:

\begin{equation} u_N = K(\{f^n \circ x\}_{n=1}^N) \end{equation}

and (4) implies that the limit \(u = \lim\limits_{N \to \infty} u_N\) exists. It follows that \(f\) is ergodic since:

\begin{equation} \exists m \in \mathbb{N} \forall N \geq m, \lVert u - u_N \rVert < \epsilon \end{equation}

i.e. the statistical properties of \(f\) are deducible fro a sufficiently large number of samples.

## Theorem 2: The Successor function is not approximable

The successor function may be identified with the dynamical system:

\begin{equation} \forall N \in \mathbb{N}, N+1 = S \circ N \end{equation}

However, \(S\) is not approximable since almost all integers are incompressible [3]:

\begin{equation} K(N) \sim \log_2 N \end{equation}

which violates the condition that \(S\) may be simulated on a computer with finite memory.

## Discussion:

One corollary worth developing for example is the idea that there is a direct relationship between the algorithmic complexity of \(\hat{f}\) that is within \(\epsilon\) of \(f\) and the number of samples \(N\) required to approximate \(f\). We roughly have the following scaling law:

\begin{equation} K(\hat{f}) \sim N \cdot \log \lvert \Sigma \rvert \end{equation}

provided that we have the right function space \(F\) and provided that we can construct the right optimisation algorithm \(\mathcal{A}\) which exploits the loss landscape in an intelligent manner.

## References:

- A. N. Kolmogorov Three approaches to the quantitative definition of information. Problems of Information and Transmission, 1(1):1–7, 1965
- Peter Grünwald and Paul Vitányi. Shannon Information and Kolmogorov Complexity. 2010.
- Lance Fortnow. Kolmogorov Complexity. 2000.