An artist's rendition of trajectories in latent space(source: CERN)

# Introduction:

Let’s suppose that one day you could somehow visualize what was going on inside deep neural networks(with relu activation of course) while they were being fed data at an insane rate by GPUs. What would you see?

If you embed the neural network calculations in some high-dimensional affine space, i.e. lacking an origin, then you would probably see the equivalent of high-dimensional particles hurtling at an insane rate. Information collapsing into dot products at the nodes of one layer before fanning out again in unpredictable directions due to the affine transformations at the next layer. Unpredictable to yourself because you are an external observer that has no idea what’s going on…but with this extrinsic perspective it’s still possible to infer a lot even if your choice of ambient space might be a nonsensical parametrization with respect to the neural network.

As a result of careful analysis you might realise the following:

1. The latent space of a deep rectifier network $F_\theta$ is an Orthogonal Function space and functions by de-correlating input signals $x \sim X$. This became clear to me a few weeks ago.

2. Besides the algebraic structure you might also notice how lower levels of the network may be identified with re-usable geometric transformations that are used exponentially more often than expressions at the higher levels of the network. So the deep rectifier network may be identified with a special kind of geometric decision tree.

3. In higher-dimensional spaces geometric transformations correspond to Jacobians if we can justify the transition from discrete-time to continuous time-and-space geometric decision trees.

4. From the perspective of function approximation I show that this transition is justified and also that function approximation implies manifold learning. I also explain that information concerning the manifold must be encoded by sequences of affine transformations which are trajectories of information in latent space.

5. Finally, I show that the trajectory formalism leads to a natural statistical relation between linear interpolations in parameter space $\theta$ and the non-convexity of $F_\theta$.

Given that many readers of this article may not have read my two previous articles on deep rectifier networks, my objective is to first summarize the Function Space and Geometric Decision Tree perspective before motivating the trajectory-in-latent-space perspective. I then show that the structure of trajectories in latent space natural statistical relation between linear interpolations in parameter space $\theta$ and the non-convexity of $F_\theta$. This explanation is accompanied by calculations using sequences of random matrices in Julia Lang.

# The Orthogonal Function Space structure of the latent space:

Mathematically, we may introduce deep rectifier networks as follows:

where the parameter space $\theta$ is defined as follows:

$$\theta = \big\{\ W_l \in \mathbb{R}^{ n_l \times n_{l-1}},b_l \in \mathbb{R}^{ n_l}: l \in [L] \big\}$$

so if there are $L$ layers and $N = \sum_{i=1}^L n_i$ nodes you may observe that:

1. The ReLU activation serves as a gating mechanism for a deep network with $N$ nodes.
2. This gating mechanism decomposes the latent space of a deep rectifier network into $m \leq 2^N$ linear feature maps $\phi_i$.
3. Each of these feature maps $\phi_i$ have compact support and their domains $X_i$ are pair-wise disjoint.
4. It follows that the latent space of $F_\theta$ forms an orthogonal function space.
5. The continuity and compact support of the $\phi_i$ implies that the $\phi_i$ are square integrable so the latent space of $F_\theta$ forms a Hilbert space.

This is interesting because many natural signals, ex. sounds, inhabit Hilbert spaces.

# Deep Rectifier Networks as Geometric Decision trees:

The Hilbert Space perspective is insightful as it shows that deep rectifier networks function by de-correlating input signals $x \sim X$ but this algebraic perspective doesn’t explain why depth or width of a rectifier network may be important. For this we need to venture beyond algebraic structure and think in terms of geometry.

A colleague suggested that I have a look at ‘Tropical Geometry of Deep Neural Networks’ [2] which supposedly explains why deeper networks are exponentially more expressive but as I have little familiarity with tropical geometry I decided to try and follow my own reasoning first. Here’s what I found:

1. A deep rectifier network may be identified with an $N$ level decision tree for solving geometric problems in a sequential manner with lower layers of the network forming re-usable sub-expressions.

2. Each expression is an affine transformation and all geometric transformations in Euclidean space are affine. It follows that deep rectifier networks may be viewed as geometric programs where each sub-expression is a geometric operation.

3. At each layer of the network the candidate expressions are a subset of the power set of distinct nodes at that layer. Hence the importance of network width for versatile geometric reasoning.

4. Furthermore, each sub-expression at layer $i$ may be re-used at most $2^{N-\sum_{k=1}^i n_k}$ times by sub-expressions at higher levels of the decision tree. Montufar [1] gives a similar argument but he identifies expressions with ‘folds’ which is incorrect in my opinion.

5. The importance of the last statement is clear when we think in terms of geometric transformations. Deep rectifier networks permit an exponential number of possible sequences of geometric transformations of length $N$ and therefore the complexity of the set of possible transformations of the latent space is proportional to network depth.

In summary depth gives us geometric complexity and width gives us versatility. But, how should we think of geometric transformation in higher-dimensional space?

# Geometric transformations in higher-dimensional space:

The best way to understand spatial deformations in higher-dimensional affine spaces is to go to the continuous space limit where we may think in terms of the Jacobian and the determinant of the Hessian, which tells us something about local curvature.

Given an affine transformation $T$ from $Aff(\mathbb{R}^3)$:

$$x \overset{T}\mapsto Ax + b$$

the Jacobian of $T$ is simply $J(T)=A$.

The reason why this continuous-space approximation is valid is that as network depth increases we may think in terms of smooth trajectories in latent space. The geometry of these trajectories is essential for both function approximation and manifold learning.

# From discrete-time to continuous-time-and-space geometric decision trees:

Sequences of affine transformations, which channel trajectories in latent space, may not be reduced to a single affine transformation because the former tells you how space was traversed whereas the latter tells you only the origin and destination of the sequence of affine transformations. We can make this argument more precise by quantifying the information gained at each level of the geometric decision tree.

Given that the output at level $i$ tells you everything you need to know about the $N- \sum_{k=1}^i n_k$ remaining nodes, the information gained at the ith level of the geometric decision tree is exactly:

$$\log_2(2^{N- \sum_{k=1}^i n_k})= N- \sum_{k=1}^i n_k$$

For these reasons we may argue that trajectories capture important relative information and in the next section I explain that they encode information concerning the manifold of $X$.

# Function approximation as manifold learning, or how to see the forest from the trees:

Thinking in terms of functions and function spaces is generally more useful than reasoning about networks with particular parametrizations. One reason for this is that if we assume that a fully-connected network has $N$ layers and $n_i$ nodes per layer, there are:

$$\prod_{i=1}^N n_i !$$

layer-wise permutations that result in functions equivalent to $F_\theta$. This is clear when you think about how dot products encode no information about summation order.

Another reason why the function approximation perspective is insightful is that it allows us to reason abou the deformation in latent spaces as we increase network depth. These intricate deformations in latent space are responsible for the usefulness of the Hilbert Space structure of the latent space.

In this context, let $T_n$ denote an affine transformation and $X_0 \sim X$ denote a signal randomly sampled from $X$. Then:

$$X_n := T_n \circ X_{n-1}$$

denotes an element from an affine space.

Using (6) we may define a sequence of affine transformations in latent space:

$$T^N := T_N \circ T_{N-1} \circ … \circ T_1$$

If we assume that function approximation capacity improves with network depth these sequences converge point-wise:

$$\lim_{N \to \infty} T^N(X_0) = F(X_0)=Y_0$$

where $F$ is the function to be approximated.

Now, from a statistical perspective function approximation requires minimising the conditional entropy:

$$H(Y|X) = H(Y,X) - H(X) \geq 0$$

and this means that learning the joint distribution $P(Y,X)$ requires learning everything there is to know about the structure of $X$. Intuitively this makes sense as exploiting the structure of spatio-temporal signals makes it easier to de-correlate the signals $x \sim X$ in a useful way. It follows that discovering informative trajectories in latent space is equivalent to discovering a useful but not necessarily unique Hilbert Space structure for the latent space. This is how I see the forest from the trees.

Having motivated the importance of trajectories in latent space, I’d like to show how they explain why deep networks with $N > 1$ layers have increasingly non-convex learning behaviour as $N$ becomes large.

# The statistical relation between linear variations in parameter space $\theta$ and the non-convexity of $F_\theta$:

Let’s suppose $\phi_i$ denotes a particular feature map in an $N$ layer deep rectifier network:

$$\phi_i := T_{N} \circ T_{N-1} \circ … \circ T_1$$

and let’s suppose $\hat{T}$ denotes an affine mapping:

$$x \overset{\hat{T}}\mapsto Ax + b$$

If we consider the scenario that $\forall x \in \bar{X} \subset X, \phi(x_0) = \hat{T}(x_0)$ can we conclude that:

$$\phi_i \rvert_{x_0 \in \bar{X}} \equiv \hat{T} \rvert_{x_0 \in \bar{X}}$$

well, the answer is no for two important reasons:

1. First, we lose essential information about the trajectory of $x_0$ in latent space due to the sequence of affine transformations in $\phi_i$.
2. Second, if we perturb the parameters of $\hat{T}$ we will get a linear effect with probability 1.0, so $\hat{T}$ is convex w.r.t. its parameters, whereas a slight perturbation of the parameters of $\phi_i$ using backprop for example leads to a non-linear effect with probability tending to 1.0 as $N$ becomes large.

The second point implies that $\phi_i$ is unlikely to be convex with respect to linear interpolations in paramter space $\theta$ as $N$ becomes large. Using Julia Lang, I found this to be the case by constructing chains of Xavier-initialised 3x3 matrices and checking whether a weak version of the usual convexity inequality was satisfied:

$$\forall \theta_1, \theta_2 \in \theta, \lVert T_{t \theta_1 + (1-t)\theta_2}(n) \rvert_{x \in X} \rVert_{\infty} \leq \lVert tT_{\theta_1}(n)\rvert_{x \in X} + (1-t)T_{\theta_2}(n)\rvert_{x \in X} \rVert_{\infty}$$

where $n \in \mathbb{N}$ and $T_\theta$ is identified with a sequence of Xavier-initialised, i.e. random, affine transformations where the bias term was set to a constant vector of 0.1:

$$T_\theta := X_{n} \circ X_{n-1} \circ … \circ X_1$$

and as expected the probability that this inequality was satisfied decreased quickly as the length of the sequence $n$ was allowed to increase from $10$ to $200$.

# Discussion:

I think that there is a false axiom embedded in this whole convex vs. non-convex discussion. We are somehow assuming that there exists a canonical parametrization of neural networks and that this parametrization is linear. Yet, this extrinsic view merely reflects the observer’s Euclidean bias.

It makes more sense to take an intrinsic view and understand how the neural network re-parametrizes the latent space in order to make it linear with respect to the intrinsic geometry that is appropriate for the manifold on which $X$ happens to reside. The learning process essentially involves approximating the structure of spatio-temporal signals.

In fact, this line of reflection leads me to the following conjecture concerning deep learning:

1. Convexification: The early stages of learning involves exploring and selecting useful parametrizations from a large number of possible parametrizations of the space.
2. Optimisation: The later stages of learning involve fine-tuning a chosen parametrization.

Furthermore, I suspect that these two learning regimes have different dynamics that can be analysed and that such analysis is key to developing a powerful theory of intrinsic geometry that can be applied to a variety of spatio-temporal signals. This will certainly lead to more powerful learning algorithms and statistical models whose learning and inferential mechanisms we shall actually understand. The reason being that we have a sequence of geometric transformations that may potentially lead to arbitrarily complex trajectories in latent space.

I’d like to end this discussion by noting that the history of physics has made considerable progress largely by identifying suitable parametrizations for general classes of natural signals. In the case of macroscopic motions through space Galilean reference frames were sufficiently general. For General Relativity on the other hand, Riemannian geometry proved to be essential. But, that only became clear because Einstein, Lorentz and others made an enormous effort to understand what was going on.

# References:

1. Montufar, G. et al. On the Number of linear Regions of Deep Neural Networks. 2014.
2. L. Zhang, G. Naitzat & Lek-Heng Lim. Tropical Geometry of Deep Neural Networks. 2018.