# Introduction:

After thinking about deep rectifier networks over the weekend it occurred to me that a function space interpretation may be useful. Briefly speaking:

1. The ReLU activation serves as a gating mechanism for a deep network with $N$ nodes.
2. This gating mechanism decomposes the latent space of a deep rectifier network into $m \leq 2^N$ affine feature maps $\phi_i$.
3. Each of these feature maps $\phi_i$ have domains $X_i$ that are pair-wise disjoint such that $X=\cup_{i=1}^m X_i$.
4. As a result the latent space of a deep rectifier network is a space of orthogonal functions so a deep rectifier network functions by de-correlating input signals $x \sim X$.
5. Furthermore, if we consider that input signals $x \sim X$ transform the latent space this leads to a natural definition of independently-controllable features.

One particular reason why I find the function space perspective useful is that it generalises easily to all networks with relu activation. Furthermore, it leads to several natural conjectures.

# Analysis of the ReLU activation:

## Derivation from the sigmoid:

The softplus: $f(x) = \ln(1+e^x)$ can be derived directly from the integral of the sigmoid:

$$\int \sigma(x) dx = \ln(1+e^x) + C$$

so we may view the softplus as being the accumulation of input from an infinite number of perturbed sigmoids.

Now, from the softplus we may derive the ReLU which we abbreviate as $g$:

$$\ln(1+e^x) \approx g(x) = x \cdot (x \geq 0)$$

when $\lvert x \lvert > 5$. Mathematically, this has the effect of being a strict gating mechanism.

## Algebraic properties of ReLU:

For any vector $v$ or matrix $M$, or tensors in general, we may use $\hat{g}$ to denote the application of $g$ to each component of $v$ or $M$ respectively. From this simple definition we note that:

### $\hat{g}$ distributes over addition only in special circumstances:

$$\hat{g}(W + \Delta W) = \hat{g}(W) + \hat{g}(\Delta W)$$

if and only if $W$ and $\Delta W$ are both non-negative tensors.

### Iterated application of $\hat{g}$ is equivalent to $Id \circ \hat{g}$:

$$\hat{g}^n = Id \circ \hat{g} = \hat{g}$$

where $Id$ is the identity mapping.

### Existence of the inverse $\hat{g}^{-1}$:

The inverse exists only when restricted to the union of sets $A$ where $\hat{g}(A)=A$. Naturally, these sets must be non-negative.

# Orthogonal function spaces:

## Mathematical definition of deep rectifier networks:

A deep rectifier network is typically introduced as a composition of simple functions which defines a non-linear mapping:

where the parameter space $\theta$ is defined as follows:

$$\theta = \big\{\ W_l \in \mathbb{R}^{ n_l \times n_{l-1}},b_l \in \mathbb{R}^{ n_l}: l \in [L] \big\}$$

## $F_\theta$ as a function space:

From the definition, we may deduce that the total number of nodes in the deep rectifier network is given by:

$$N = \sum_{l=1}^L n_l$$

and from this it follows that we may interpret a deep rectifier network as a space of at most $m \leq 2^N$ functions $f_i$ since each node in $F$ may be ‘on’ or ‘off’.

## $F_\theta = \cup_{i=1}^m f_i$ partitions the domain of $F_\theta$:

It’s useful to note that if the domain of $F_\theta$ is $X \subset \mathbb{R}^{n_i}$, the associated functions $f_i$ partition $X$ into pair-wise disjoint compact sets:

$$X_i \cap X_{j \neq i} = \emptyset$$

so $X=\cup_{i=1}^m X_i$.

## An orthogonal function space:

Given that each $\phi_i$ in the latent space of $F_\theta$ has pair-wise disjoint domains:

$$\phi_i(X \setminus X_i) = 0$$

and therefore we may define an inner-product on the latent space:

$$\forall \phi_i, \phi_{j \neq i} , \langle \phi_i, \phi_j \rangle = \int_{x \in X} \phi_i(x)\phi_j(x) dx = 0$$

so the latent space of $F_\theta$ may be represented as follows:

$$\phi(x) = \sum_{i=1}^m \phi_i(x)$$

From this we may deduce that a deep rectifier network finds structure in $X$ by de-correlating the signals $x \sim X$.

# From independent feature maps to independently controllable features:

1. We can think of the data $x \sim X$ as transforming the latent space of $F_\theta$ in such a way that $m-1$ pair-wise disjoing subsets of $X$ leave a particular $\phi_i$ invariant.
2. We may make this notion explicit by denoting the union of these subsets by $\big[ X_{j\neq i}\big]$ so we might say that $\phi_i$ is invariant to the transformation induced by $\big[ X_{j\neq i}\big]$.
3. Naturally, these invariants correspond to symmetries and if we think of sequences of input signals then graph traversals on $\big[ X_{j\neq i}\big]$, represented as a $K_{m-1}$ graph where each node corresponds to a particular $X_{j \neq i}$, form a natural generating set for this group of transformations.

# Conjectures:

1. Bernoulli dropout with probability half maximises the effective size of the latent space.
2. $m$ is an approximate measure of the intrinsic dimension of $X$.

# References:

1. Nair, V. & Hinton, G. Rectifier linear Units Improve Restricted Boltzmann Machines. 2010
2. Montufar, G. et al. On the Number of linear Regions of Deep Neural Networks. 2014.
3. Srivastava, N. et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. 2014.
4. Srivastava R. et al. Understanding Locally Competitive Networks. 2014.
5. Sharpee T. , Rust N. & Bialek W. Analyzing Neural Responses to Natural Signals: Maximally Informative Dimensions. 2004.
6. Peyré G. Manifold Models for Signals and Images. 2009.
7. Schwartz O. & Simoncelli E. Natural signal statistics and sensory gain control. 2001.