Introduction:

After thinking about deep rectifier networks over the weekend it occurred to me that a function space interpretation may be useful. Briefly speaking:

  1. The ReLU activation serves as a gating mechanism for a deep network with nodes.
  2. This gating mechanism decomposes the latent space of a deep rectifier network into affine feature maps .
  3. Each of these feature maps have domains that are pair-wise disjoint such that .
  4. As a result the latent space of a deep rectifier network is a space of orthogonal functions so a deep rectifier network functions by de-correlating input signals .
  5. Furthermore, if we consider that input signals transform the latent space this leads to a natural definition of independently-controllable features.

One particular reason why I find the function space perspective useful is that it generalises easily to all networks with relu activation. Furthermore, it leads to several natural conjectures.

Analysis of the ReLU activation:

Derivation from the sigmoid:

The softplus: can be derived directly from the integral of the sigmoid:

\begin{equation} \int \sigma(x) dx = \ln(1+e^x) + C \end{equation}

so we may view the softplus as being the accumulation of input from an infinite number of perturbed sigmoids.

Now, from the softplus we may derive the ReLU which we abbreviate as :

\begin{equation} \ln(1+e^x) \approx g(x) = x \cdot (x \geq 0) \end{equation}

when . Mathematically, this has the effect of being a strict gating mechanism.

Algebraic properties of ReLU:

For any vector or matrix , or tensors in general, we may use to denote the application of to each component of or respectively. From this simple definition we note that:

distributes over addition only in special circumstances:

\begin{equation} \hat{g}(W + \Delta W) = \hat{g}(W) + \hat{g}(\Delta W) \end{equation}

if and only if and are both non-negative tensors.

Iterated application of is equivalent to :

\begin{equation} \hat{g}^n = Id \circ \hat{g} = \hat{g} \end{equation}

where is the identity mapping.

Existence of the inverse :

The inverse exists only when restricted to the union of sets where . Naturally, these sets must be non-negative.

Orthogonal function spaces:

Mathematical definition of deep rectifier networks:

A deep rectifier network is typically introduced as a composition of simple functions which defines a non-linear mapping:

where the parameter space is defined as follows:

\begin{equation} \theta = \big\{\ W_l \in \mathbb{R}^{ n_l \times n_{l-1}},b_l \in \mathbb{R}^{ n_l}: l \in [L] \big\} \end{equation}

as a function space:

From the definition, we may deduce that the total number of nodes in the deep rectifier network is given by:

\begin{equation} N = \sum_{l=1}^L n_l \end{equation}

and from this it follows that we may interpret a deep rectifier network as a space of at most functions since each node in may be ‘on’ or ‘off’.

partitions the domain of :

It’s useful to note that if the domain of is , the associated functions partition into pair-wise disjoint compact sets:

\begin{equation} X_i \cap X_{j \neq i} = \emptyset \end{equation}

so .

An orthogonal function space:

Given that each in the latent space of has pair-wise disjoint domains:

\begin{equation} \phi_i(X \setminus X_i) = 0 \end{equation}

and therefore we may define an inner-product on the latent space:

\begin{equation} \forall \phi_i, \phi_{j \neq i} , \langle \phi_i, \phi_j \rangle = \int_{x \in X} \phi_i(x)\phi_j(x) dx = 0 \end{equation}

so the latent space of may be represented as follows:

\begin{equation} \phi(x) = \sum_{i=1}^m \phi_i(x) \end{equation}

From this we may deduce that a deep rectifier network finds structure in by de-correlating the signals .

From independent feature maps to independently controllable features:

  1. We can think of the data as transforming the latent space of in such a way that pair-wise disjoing subsets of leave a particular invariant.
  2. We may make this notion explicit by denoting the union of these subsets by so we might say that is invariant to the transformation induced by .
  3. Naturally, these invariants correspond to symmetries and if we think of sequences of input signals then graph traversals on , represented as a graph where each node corresponds to a particular , form a natural generating set for this group of transformations.

Conjectures:

  1. Bernoulli dropout with probability half maximises the effective size of the latent space.
  2. is an approximate measure of the intrinsic dimension of .

References:

  1. Nair, V. & Hinton, G. Rectifier linear Units Improve Restricted Boltzmann Machines. 2010
  2. Montufar, G. et al. On the Number of linear Regions of Deep Neural Networks. 2014.
  3. Srivastava, N. et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. 2014.
  4. Srivastava R. et al. Understanding Locally Competitive Networks. 2014.
  5. Sharpee T. , Rust N. & Bialek W. Analyzing Neural Responses to Natural Signals: Maximally Informative Dimensions. 2004.
  6. Peyré G. Manifold Models for Signals and Images. 2009.
  7. Schwartz O. & Simoncelli E. Natural signal statistics and sensory gain control. 2001.