The Information Geometry of Occam's razor
From the perspective of a computation device, function definitions are generally equivalent to function evaluations because the data stream:
\begin{equation} x_{n+1} = f \circ x_n \end{equation}
implies that \(f \circ x_n\) and \(x_{n+1}\) have identical representations with respect to the computer.
It follows that from the perspective of a universal computer capable of simulating any dynamical system, the minimum description length of a probabilistic program \(\Omega\) modelling a data stream \(X\):
\begin{equation} \mathbb{E}[K(X)] = H(X|\Omega) + H(\Omega) \end{equation}
is proportional to the average amount of energy required to run it on an arbitrary input \(x \in \text{Dom}(\Omega)\).
Given the Planck-Church-Turing thesis, this implies that \(\mathbb{E}[K(\cdot)]\) is equivalent to the Minimum Energy Principle or what some call the Principle of Least Action. It follows that Occam’s razor is a search algorithm over the space of probabilistic programs in an appropriate information geometry.
In fact, just as the Principle of Least action has a variational formulation we may note that if we define the functional:
\begin{equation} \forall f \in F_{\theta}, J[f] = H(X|f) + H(f) \end{equation}
where \(f\) is a probabilistic program parametrised by a neural network, we may recover (1) via the functional derivative:
\begin{equation} \frac{\delta J}{\delta f} \Big|_{f = \Omega} = 0 \end{equation}
The only problem is that \(J\) does not generally define a convex optimisation problem so in practice we may use a modified search algorithm: (1) choose the smallest \(f \in F_{\theta}\) (2) that minimises \(D_{KL}[P_X|P_f]\). Alternatively, as proposed by Schmidhuber, we may reverse the procedure by finding the most accurate model then shrinking it in order to remove irrelevant axioms.
I will note that regardless of the heuristic we use, as we generally want to minimise the divergence \(D_{KL}[P_X|P_f]\) we will generally use an approximation to the natural gradient.
References:
-
J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857–873, 1997.
-
Shun-ichi Amari. Natural Gradient Works Efficiently in Learning. Neural Computation 10, 251–276 (1998)
-
Peter D. Grünwald. The Minimum Description Length Principle . MIT Press. 2007.