From the perspective of a computation device, function definitions are generally equivalent to function evaluations because the data stream:

\begin{equation} x_{n+1} = f \circ x_n \end{equation}

implies that \(f \circ x_n\) and \(x_{n+1}\) have identical representations with respect to the computer.

It follows that from the perspective of a universal computer capable of simulating any dynamical system, the minimum description length of a probabilistic program \(\Omega\) modelling a data stream \(X\):

\begin{equation} \mathbb{E}[K(X)] = H(X|\Omega) + H(\Omega) \end{equation}

is proportional to the average amount of energy required to run it on an arbitrary input \(x \in \text{Dom}(\Omega)\).

Given the Planck-Church-Turing thesis, this implies that \(\mathbb{E}[K(\cdot)]\) is equivalent to the Minimum Energy Principle or what some call the Principle of Least Action. It follows that Occam’s razor is a search algorithm over the space of probabilistic programs in an appropriate information geometry.

In fact, just as the Principle of Least action has a variational formulation we may note that if we define the functional:

\begin{equation} \forall f \in F_{\theta}, J[f] = H(X|f) + H(f) \end{equation}

where \(f\) is a probabilistic program parametrised by a neural network, we may recover (1) via the functional derivative:

\begin{equation} \frac{\delta J}{\delta f} \Big|_{f = \Omega} = 0 \end{equation}

The only problem is that \(J\) does not generally define a convex optimisation problem so in practice we may use a modified search algorithm: (1) choose the smallest \(f \in F_{\theta}\) (2) that minimises \(D_{KL}[P_X|P_f]\). Alternatively, as proposed by Schmidhuber, we may reverse the procedure by finding the most accurate model then shrinking it in order to remove irrelevant axioms.

I will note that regardless of the heuristic we use, as we generally want to minimise the divergence \(D_{KL}[P_X|P_f]\) we will generally use an approximation to the natural gradient.

References:

  1. J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.

  2. Shun-ichi Amari. Natural Gradient Works Efficiently in Learning. Neural Computation 10, 251–276 (1998)

  3. Peter D. Grünwald. The Minimum Description Length Principle . MIT Press. 2007.