Kepler Lounge: The information theory behind UMAP

Aidan Rocke

Introduction:

The following analysis builds upon the work of Sebastian Damrich and Fred A. Hamprecht who derived the effective loss function for UMAP [1]. In particular, they found that UMAP actually searches for a binarized approximation of high-dimensional similarities.

Further investigation lead us to information-theoretic foundations for UMAP that deviates considerably from the original Uniformity assumption proposed by the authors of UMAP [2]. Finally, we show that the binary cross-entropy formula equals the KL Divergence between high and low probability distributions over edges plus a constant term.

This brief analysis, in particular the equivalence of KL Divergence and the Binary Cross-Entropy, prepares the ground for more sophisticated analyses using the framework of Information Geometry.

The Uniformity assumption:

What follows is a quasi-verbatim summary of the Uniformity assumption as described in [2].

If we assume the data to be uniformly distributed on the Riemannian Manifold \(M\) with respect to the metric \(g\), then away from any boundaries any ball of fixed volume should contain the same number of points of \(X\) regardless of where on the manifold it is centred.
Given finite data and small enough local neighborhoods, this crude approximation should be accurate enough even for data samples near manifold boundaries.
Conversely, a ball centred at \(X_i\) that contains exactly the k-nearest neighbors of \(X_i\) should have approximately fixed volume regardless of the choice of \(X_i \in X\).

In essence, by creating a custom distance for each \(X_i\) we can ensure the validity of the assumption of uniform distribution on the manifold…though the challenge remains to merge this into a consistent global structure.

Revisiting the Uniformity assumption:

Upon closer inspection, their uniformity assumption relies upon additional assumptions that aren’t carefully discussed:

The data lies on a locally Euclidean manifold and there is no funny concentration of measure phenomenon in high-dimensional spaces.
Due to Occam’s razor, local distances encode the statistical distribution of the data across the manifold so information geometry is the appropriate setting for analysing UMAP.
The local geometry determines local connectivity and hence local entropy. Moreover, a particular volume constraint is used in their algorithm:

\[\begin{equation} \sum p_{ij} = \log_2 k \tag{1} \end{equation}\]

which is derived through a population of \(k\) spatial Gaussian distributions. These determine a form of metric-based importance sampling from Bernoulli distributions that share the same rescaling coefficient \(\sigma\). However, this volume constraint deviates in a considerable manner from the volume constraint in their Uniformity assumption.

In fact, various scientists [1] have found that the degree of each node is approximately equal to \(\log_2 k\). This topological volume constraint presumes that local connectivity determines local curvature through a finite subcover.

Unfortunately, this requires delicate proofs of conditional results that the UMAP authors have omitted from [2].

High-dimensional Probabilities:

Probabilities are not normalised because the goal of the algorithm is to recover the binarised nearest-neighbor graph as an approximation of the local geometry.
To be precise, UMAP uses independent Gaussians in high dimensions that define distinct Bernoulli distributions:

\[\begin{equation} p_{i|j} = e^{-\frac{d(x_i,x_j)-\rho_i}{\sigma_i}} \tag{2} \end{equation}\]

where \(\rho_i\) represents the distance from the ith data point to its closest nearest neighbor.

This implies that the UMAP authors consider distinct pair-wise relations to be independent of each other. As a side note, the absence of conventional normalisation dramatically reduces the time for computing the higher-dimensional graph as integration is computationally expensive.

Using entropy to count nearest neighbors:

UMAP uses the following symmetrisation of high-dimensional probability:

\[\begin{equation} p_{ij} = p_{i|j} + p_{j|i} - p_{i|j}\cdot p_{j|i} \implies p_{ij} = p_{ji} \tag{3} \end{equation}\]

which is necessary since after UMAP glues points with locally varying metrics (due to parameter \(\rho\)), the degree of belief of the edge \(A \rightarrow B\) may not equal the degree of belief of \(B \rightarrow A\).

As each edge in the binarised nearest-neighbor graph approximates one bit of information:

\[\begin{equation} k = 2^{\sum_i p_{ij}} \tag{4} \end{equation}\]

where \(\sum p_{ij} = \log_2 k\) is an entropy term and the probabilities \(p_{ij}\) approximate Boolean truth values.

Low-dimensional probabilities:

If \(Y\) denotes the stochastic geometry of the low-dimensional space with ambient dimension \(n \leq 3\) so dataset visualization tools are applicable, then the UMAP authors use the following approximation to the student’s t-distribution:

\[\begin{equation} (1+a(y_i-y_j)^{2b})^{-1} \tag{5} \end{equation}\]

which is approximately \(1.0\) when \(y_i-y_j \leq d_{\text{min}}\) and approximately zero otherwise when the hyper-parameters are given by the constants \(a=1.93,b=0.79\) and \(d_{\text{min}} = 0.001\).

We may note that the student’s t-distribution is a sensible choice when the variance of the distribution is unknown.

On the equivalence of Binary Cross-Entropy and the KL Divergence:

If the set of all possible 1-simplices is \(E\)(ie set of possible edges) we have Bernoulli probability \(w_h(e)\) for the 1-simplex \(e\) in the higher-dimensional case and we have Bernoulli probability \(w_l(e)\) in the low-dimensional case the cross-entropy becomes:

\[\begin{equation} \sum_{e \in E} w_h(e) \log \big(\frac{w_h(e)}{w_l(e)}\big) + \sum_{e \in E} (1-w_h(e)) \log \big(\frac{1-w_h(e)}{1-w_l(e)}\big) = -\sum_{e \in E} w_h(e) \log w_l(e) - \sum_{e \in E} (1-w_h(e)) \log (1-w_l(e)) + \text{Cst} \tag{6} \end{equation}\]

where the constant term emerges from the fact that the optimisation of low-dimensional probabilities begins once the high-dimensional graph is defined and fixed.

By setting \(f_X(e) := w_h(e)\) and \(f_Y(e):=w_l(e)\) we may note that the expression for binary cross-entropy simplifies to:

\[\begin{equation} H(X,Y) = -\sum_{e \in E} f_X(e) \log f_Y(e) \tag{7} \end{equation}\]

where \(H(X)\) is constant so we may deduce that:

\[\begin{equation} D_{KL}(X|Y) = H(X,Y) + H(X) = H(X,Y) + \text{Cst} \tag{8} \end{equation}\]

Thus, the KL Divergence induces equivalent learning dynamics via Gradient Descent optimisation.

Discussion: