After thinking about deep rectifier networks over the weekend it occurred to me that a function space interpretation may be useful. Briefly speaking:

  1. The ReLU activation serves as a gating mechanism for a deep network with nodes.
  2. This gating mechanism decomposes the latent space of a deep rectifier network into affine feature maps .
  3. Each of these feature maps have domains that are pair-wise disjoint such that .
  4. As a result the latent space of a deep rectifier network is a space of orthogonal functions so a deep rectifier network functions by de-correlating input signals .
  5. Furthermore, if we consider that input signals transform the latent space this leads to a natural definition of independently-controllable features.

One particular reason why I find the function space perspective useful is that it generalises easily to all networks with relu activation. Furthermore, it leads to several natural conjectures.

Analysis of the ReLU activation:

Derivation from the sigmoid:

The softplus: can be derived directly from the integral of the sigmoid:

\begin{equation} \int \sigma(x) dx = \ln(1+e^x) + C \end{equation}

so we may view the softplus as being the accumulation of input from an infinite number of perturbed sigmoids.

Now, from the softplus we may derive the ReLU which we abbreviate as :

\begin{equation} \ln(1+e^x) \approx g(x) = x \cdot (x \geq 0) \end{equation}

when . Mathematically, this has the effect of being a strict gating mechanism.

Algebraic properties of ReLU:

For any vector or matrix , or tensors in general, we may use to denote the application of to each component of or respectively. From this simple definition we note that:

distributes over addition only in special circumstances:

\begin{equation} \hat{g}(W + \Delta W) = \hat{g}(W) + \hat{g}(\Delta W) \end{equation}

if and only if and are both non-negative tensors.

Iterated application of is equivalent to :

\begin{equation} \hat{g}^n = Id \circ \hat{g} = \hat{g} \end{equation}

where is the identity mapping.

Existence of the inverse :

The inverse exists only when restricted to the union of sets where . Naturally, these sets must be non-negative.

Orthogonal function spaces:

Mathematical definition of deep rectifier networks:

A deep rectifier network is typically introduced as a composition of simple functions which defines a non-linear mapping:

where the parameter space is defined as follows:

\begin{equation} \theta = \big\{\ W_l \in \mathbb{R}^{ n_l \times n_{l-1}},b_l \in \mathbb{R}^{ n_l}: l \in [L] \big\} \end{equation}

as a function space:

From the definition, we may deduce that the total number of nodes in the deep rectifier network is given by:

\begin{equation} N = \sum_{l=1}^L n_l \end{equation}

and from this it follows that we may interpret a deep rectifier network as a space of at most functions since each node in may be ‘on’ or ‘off’.

partitions the domain of :

It’s useful to note that if the domain of is , the associated functions partition into pair-wise disjoint compact sets:

\begin{equation} X_i \cap X_{j \neq i} = \emptyset \end{equation}

so .

An orthogonal function space:

Given that each in the latent space of has pair-wise disjoint domains:

\begin{equation} \phi_i(X \setminus X_i) = 0 \end{equation}

and therefore we may define an inner-product on the latent space:

\begin{equation} \forall \phi_i, \phi_{j \neq i} , \langle \phi_i, \phi_j \rangle = \int_{x \in X} \phi_i(x)\phi_j(x) dx = 0 \end{equation}

so the latent space of may be represented as follows:

\begin{equation} \phi(x) = \sum_{i=1}^m \phi_i(x) \end{equation}

From this we may deduce that a deep rectifier network finds structure in by de-correlating the signals .

From independent feature maps to independently controllable features:

  1. We can think of the data as transforming the latent space of in such a way that pair-wise disjoing subsets of leave a particular invariant.
  2. We may make this notion explicit by denoting the union of these subsets by so we might say that is invariant to the transformation induced by .
  3. Naturally, these invariants correspond to symmetries and if we think of sequences of input signals then graph traversals on , represented as a graph where each node corresponds to a particular , form a natural generating set for this group of transformations.


  1. Bernoulli dropout with probability half maximises the effective size of the latent space.
  2. is an approximate measure of the intrinsic dimension of .


  1. Nair, V. & Hinton, G. Rectifier linear Units Improve Restricted Boltzmann Machines. 2010
  2. Montufar, G. et al. On the Number of linear Regions of Deep Neural Networks. 2014.
  3. Srivastava, N. et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. 2014.
  4. Srivastava R. et al. Understanding Locally Competitive Networks. 2014.
  5. Sharpee T. , Rust N. & Bialek W. Analyzing Neural Responses to Natural Signals: Maximally Informative Dimensions. 2004.
  6. Peyré G. Manifold Models for Signals and Images. 2009.
  7. Schwartz O. & Simoncelli E. Natural signal statistics and sensory gain control. 2001.