From the perspective of a computation device, function definitions are generally equivalent to function evaluations because the data stream:

\begin{equation} x_{n+1} = f \circ x_n \end{equation}

implies that $$f \circ x_n$$ and $$x_{n+1}$$ have identical representations with respect to the computer.

It follows that from the perspective of a universal computer capable of simulating any dynamical system, the minimum description length of a probabilistic program $$\Omega$$ modelling a data stream $$X$$:

\begin{equation} \mathbb{E}[K(X)] = H(X|\Omega) + H(\Omega) \end{equation}

is proportional to the average amount of energy required to run it on an arbitrary input $$x \in \text{Dom}(\Omega)$$.

Given the Planck-Church-Turing thesis, this implies that $$\mathbb{E}[K(\cdot)]$$ is equivalent to the Minimum Energy Principle or what some call the Principle of Least Action. It follows that Occam’s razor is a search algorithm over the space of probabilistic programs in an appropriate information geometry.

In fact, just as the Principle of Least action has a variational formulation we may note that if we define the functional:

\begin{equation} \forall f \in F_{\theta}, J[f] = H(X|f) + H(f) \end{equation}

where $$f$$ is a probabilistic program parametrised by a neural network, we may recover (1) via the functional derivative:

\begin{equation} \frac{\delta J}{\delta f} \Big|_{f = \Omega} = 0 \end{equation}

The only problem is that $$J$$ does not generally define a convex optimisation problem so in practice we may use a modified search algorithm: (1) choose the smallest $$f \in F_{\theta}$$ (2) that minimises $$D_{KL}[P_X|P_f]$$. Alternatively, as proposed by Schmidhuber, we may reverse the procedure by finding the most accurate model then shrinking it in order to remove irrelevant axioms.

I will note that regardless of the heuristic we use, as we generally want to minimise the divergence $$D_{KL}[P_X|P_f]$$ we will generally use an approximation to the natural gradient.

1. J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.

2. Shun-ichi Amari. Natural Gradient Works Efficiently in Learning. Neural Computation 10, 251–276 (1998)

3. Peter D. Grünwald. The Minimum Description Length Principle . MIT Press. 2007.