Kepler Lounge: A physicist's approach to Information Geometry

Aidan Rocke

Summary of Ariel Caticha’s vision:

Ariel Caticha’s thesis is that the goal of physics is not to provide a direct and faithful image of nature but to provide a framework for processing information and making inferences. This imposes severe restrictions on physical models because the tools and methods of physics must inevitably reflect their inferential origins. Probabilities are therefore tools designed for updating or assigning probabilities.

Following E.T. Jaynes, we use the principle of maximum entropy to model ‘physical’ three-dimensional space as a curved statistical manifold. This may be readily generalised to n-dimensional space where \(n \geq 3\). The key idea is that the points of space are not defined with perfect resolution. They are not disconnected and structureless dots.

This thesis may be evaluated and refined in the domain of Quantum Statistical Mechanics.

Distance and Volume in curved spaces:

The fundamental notion behind differential geometry derives from the observation that it is locally flat. Hence, curvature may be neglected when we are within a sufficiently small region.

The idea is that within the neighborhood of any \(x \in \mathbb{R}^3\) we may transform from the original coordinates \(x^a\) to the new coordinates \(\hat{x}^a = \hat{x}^a(x^1,...,x^n)\) that are locally Cartesian.

An infinitesimal displacement is then given by:

\[\begin{equation} d\hat{x}^a = X_{a}^{\alpha} dx^a \tag{1} \end{equation}\]

where \(X_{a}^{\alpha} = \frac{\partial \hat{x}^\alpha}{\partial X^a}\).

The corresponding infinitesimal distance may be computed using the Pythagorean theorem:

\[\begin{equation} dl^2 = \delta_{\alpha \beta} d\hat{x}^{\alpha} d\hat{x}^{\beta} \tag{2} \end{equation}\]

Changing back to the original frame:

\[\begin{equation} dl^2 = \delta_{\alpha \beta} d\hat{x}^{\alpha} d\hat{x}^{\beta} = \delta_{\alpha \beta} X_{a}^{\alpha} X_{b}^{\beta} dx^a dx^b \tag{3} \end{equation}\]

and if we define the metric tensor:

\[\begin{equation} g_{ab} = \delta_{\alpha \beta} X_{a}^{\alpha} X_{b}^{\beta} \tag{4} \end{equation}\]

we may express the infinitesimal Pythagorean theorem in generic coordinates \(x^a\) as:

\[\begin{equation} dl^2 = g_{ab} dx^a dx^b \tag{5} \end{equation}\]

The Metric Tensor and the Volume Form:

The following analysis is motivated by the fact that the absolute value of a volume form is a volume element.

Under a coordinate transformation, \(g_{ab}\) transforms according to:

\[\begin{equation} g_{ab} = X_{a}^{a'} X_{b}^{b'} g_{a' b'} \tag{6} \end{equation}\]

where \(X_{a}^{a'} = \frac{\partial x^{a'}}{\partial X^a}\) so the infinitesimal distance \(dl\) is independent of the choice of coordinates.

To find the finite length between two points along a curve \(x\), parameterized by \(\lambda\), we may integrate along the curve:

\[\begin{equation} l = \int_{\lambda_1}^{\lambda_2} dl = \int_{\lambda_1}^{\lambda_2} \big(g_{ab} \frac{dx^a}{d \lambda} \frac{dx^b}{d \lambda}\big) d \lambda \tag{7} \end{equation}\]

Once we have a measure of distance, we may also measure angles, areas, volumes and all sorts of geometrical quantities. To find an expression for the n-dimensional volume element \(dV_n\) we may transform to locally Cartesian coordinates so the volume element is given by the product:

\[\begin{equation} dV_n = d\hat{x}^1 d\hat{x}^2 ... d\hat{x}^n \tag{8} \end{equation}\]

and transform back to the original coordinates \(x^a\) via:

\[\begin{equation} dV_n = \Big\lvert \frac{\partial \hat{x}}{\partial x} \Big \rvert dx^1 dx^2 ... dx^n = \lvert \text{det} X_a^{\alpha}\rvert d^n x \tag{9} \end{equation}\]

The Basics of Information Geometry:

The transformation of the metric from its Euclidean form \(\delta_{\alpha \beta}\) to \(g_{ab}\) is the produce of three matrices. Taking the determinant, we find:

\[\begin{equation} g \equiv \text{det}(g_{ab}) = \lvert \text{det} X_a^{\alpha} \rvert^2 \tag{10} \end{equation}\]

so we have:

\[\begin{equation} \lvert \text{det}(X_a^{\alpha}) \rvert = g^{\frac{1}{2}} \tag{11} \end{equation}\]

We have thus succeeded in expressing the volume element in terms of the metric \(g_{ab}(x)\) in the original coordinates \(x^a\). As a result, we have:

\[\begin{equation} d V_n = g^{\frac{1}{2}}(x) d^n x \tag{12} \end{equation}\]

The volume of any extended region on the Manifold is given by:

\[\begin{equation} V_n = \int d V_n = \int g^{\frac{1}{2}} d^n x \tag{13} \end{equation}\]

As a consequence, a uniform distribution over such a statistical manifold may be expressed as follows:

\[\begin{equation} p(x) d^n x \propto g^{\frac{1}{2}} d^n x \tag{14} \end{equation}\]

which implies that equal probabilities have equal volumes.

Derivation of the Fisher Information Metric:

Let’s suppose we would like a quantitative measure that informs us of the extent that two distributions \(p(x|\theta)\) and \(p(x|\theta + d \theta)\) are distinguishable. This approach allows us to interpret the metric as a measure of uncertainty and distinguishability.

If we consider the relative difference,

\[\begin{equation} \Delta = \frac{p(x|\theta + d\theta) \cdot p(x|\theta)}{p(x|\theta)} = \frac{\partial \log p(x| \theta)}{\partial \theta^{a}} d \theta^{a} \tag{15} \end{equation}\]

the expected value of the relative difference, \(\langle \Delta \rangle\), actually vanishes identically:

\[\begin{equation} \langle \Delta \rangle = p(x| \theta) \frac{\partial \log p(x| \theta)}{\partial \theta^{a}} d \theta^{a} dx = d\theta^a \frac{\partial}{\partial \theta^a} \int p(x|\theta) dx = 0 \tag{16} \end{equation}\]

Hence, it is isn’t a good candidate.

On the other hand, the variance does not vanish:

\[\begin{equation} dl^2 = \langle \Delta^2 \rangle = \int p(x|\theta) \frac{\partial \log p(x|\theta)}{\partial \theta^a} \frac{\log p(x|\theta)}{\partial \theta^b} d\theta^a d\theta^b dx \tag{17} \end{equation}\]

which is the measure of distinguishability we seek as a small value of \(dl^2\) implies that \(\Delta\) is negligible and that the points \(\theta\) and \(\theta + d\theta\) are indistinguishable.

This suggests defining the matrix \(g_{ab}\):

\[\begin{equation} g_{ab}(\theta) = \int p(x|\theta) \frac{\partial \log p(x|\theta)}{\partial \theta^a} \frac{\log p(x|\theta)}{\partial \theta^b} dx \tag{18} \end{equation}\]

so we have:

\[\begin{equation} dl^2 = g_{ab} d\theta^a d\theta^b \tag{19} \end{equation}\]

It was Rao that recognised that \(g_{ab}\) is a metric in the space of probability distributions. As the coordinates \(\theta\) are arbitrary, we may freely reparametrize the points in the manifold. We may then check that \(g_{ab}\) are the components of a tensor and that \(dl^2\) is a geometric invariant.

In fact, we find that the transformations:

\[\begin{equation} \theta^{a'} = f^{a'}(\theta^1,...,\theta^n) \tag{20} \end{equation}\]

leads us to:

\[\begin{equation} d\theta^a = \frac{\partial \theta^a}{\partial \theta^{a'}} d \theta^{a'} \tag{21} \end{equation}\]

Hence, we are forced to consider the simultaneous equations:

\[\begin{equation} d\theta^a \theta^b = \partial{\theta^a}{\partial \theta^{a'}} \frac{\partial \theta^b}{\partial \theta^{b'}} d \theta^{a'} d\theta^{b'} \tag{22} \end{equation}\]

\[\begin{equation} dl^2 = g_{a' b'} d\theta^{a'} d\theta^{b'} \tag{23} \end{equation}\]

and therefore,

\[\begin{equation} \frac{dl^2}{g_{a' b'}} = \frac{\partial \theta^a}{\partial \theta^{a'}} \frac{\partial \theta^b}{\partial \theta^{b'}} g_{ab} = dl^2 \tag{24} \end{equation}\]

which simplifies to:

\[\begin{equation} g_{ab} = \frac{\partial \theta^{a'}}{\partial \theta^a} \frac{\partial \theta^{b'}}{\partial \theta^b} g_{a' b'} \tag{25} \end{equation}\]

QED.

Technical Summary:

A parametric family of probability distributions is a set of distributions \(p_{\theta}(x)\) labeled by parameters \(\theta = (\theta^1,...,\theta^n)\).
Such a family forms a statistical manifold, namely a space in which each point labeled by coordinates \(\theta\) represents a probability distribution \(p_{\theta}(x)\).
Statistical manifolds possess a unique notion of distance, the information metric. In fact, geometry is intrinsic to the structure of statistical manifolds.
The distance \(dl\) between two neighboridng points \(\theta\) and \(\theta + d\theta\) is given by the Pythagorean theorem, which is expressed in the form of a metric tensor \(g_{ab}\) as:

\[\begin{equation} dl^2 = g_{ab} d\theta^a d\theta^b \end{equation}\]

Having a notion of distance means we have a notion of volume which implies that there is a unique notion of a distribution that is uniform over the space of parameters. Equal volumes are assigned equal probabilities.

References:

Ariel Caticha. Geometry from Information Geometry. MaxEnt 2015, the 35th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering. 2015.
Lay Kuan Loh & Mihovil Bartulovic. Efficient Coding Hypothesis and an Introduction to Information Theory. 2014.

A physicist’s approach to Information Geometry