Introduction:

The typical deep neural network tutorial introduces deep networks as compositions of nonlinearities and affine transforms. In fact, a deep network with relu activation simplifies to a linear combination of affine transformations with compact support. But, why would affine transformations be useful for nonlinear regression?

After some reflection it occurred to me that the reason why they work is that they are actually first-order Taylor approximations and by this logic partial derivatives, i.e. Jacobians, are computational primitives for both inference and learning. Of course this also depends on whether machine learning researchers are doing physics or stamp collecting.

Approximating continuous functions with piece-wise linear functions:

A piece-wise linear function is defined as follows:

\begin{equation} f(x) = \sum_i \lambda_i(x) \cdot 1_{X_i}(x) \end{equation}

where is the indicator function and the are affine functions. Clearly, every polynomial may be approximated using piecewise linear functions since they are locally Lipschitz.

Now, let’s consider a nonlinear continuous function with compact domain and compact co-domain :

\begin{equation} F: X \rightarrow Y \end{equation}

Since the polynomials are dense in the space of continuous functions, due to Stone-Weierstrass, we may substitute with a polynomial of degree , such that if we define the functional:

\begin{equation} \mathcal{L}[F_n] = \int_X \left\lVert F(x)-F_n(x) \right\rVert^2 dx \end{equation}

the sequence converges uniformly to :

\begin{equation} \hat{F}_n = \underset{F_n \in P_n}{\arg\min} \mathcal{L}[F_n] \end{equation}

\begin{equation} \lim_{n \to \infty} \mathcal{L}[\hat{F}_n] = 0 \end{equation}

Moreover, due to the Lipschitz condition (3) also allows us to define the maximum local error with respect to a ball :

\begin{equation} \exists C > 0, E_n(r)= \underset{x_0 \in X}{\max} \int_{B(x_0,r)} \left\lVert F(x)-\hat{F}_n(x) \right\rVert^2 dx \leq C \frac{4}{3}\pi r^5 \end{equation}

which then allows us to introduce the first-order Taylor approximation:

\begin{equation} \forall x \in B(x_0,r), \phi(x) = \frac{\partial \hat{F}_n}{\partial x} \Big\rvert _{x=x_0}(x-x_0) + x_0 \end{equation}

where is a Jacobian. This will allow us to construct a linear basis of for .

Constructing a linear basis for continuous functions:

We may construct a linear basis of for using (7) where the number of piece-wise linear that allows us to approximate is given by:

\begin{equation} N(r) = \frac{,74 \cdot \text{Vol}(X)}{\frac{4}{3}\pi r^3} \approx \frac{\text{Vol}(X)}{3} \cdot r^{-3} \end{equation}

where is the density of closely-packed spheres in . Furthermore, the have compact support and pairwise disjoint domains by construction so we have:

\begin{equation} \forall i,j\neq i, \langle \phi_i,\phi_j \rangle = \int_X \phi_i(x) \cdot \phi_j(x) dx = 0 \end{equation}

and therefore we may approximate as follows:

\begin{equation} F(x) \approx \hat{F_r}(x) = \sum_{i=1}^{N(r)} \phi_i(x) \end{equation}

We also find the following useful upper-bound using (6) and (8):

\begin{equation} \mathcal{L}[\hat{F_r}] = \int_X \left\lVert F(x)-\hat{F_r}(x) \right\rVert^2 dx \leq C \cdot \text{Vol}(X) \cdot r^2 \end{equation}

Now, the challenge is essentially to find the first-order Taylor approximations and I will try to show that this is basically what is being done by a deep neural network.

Fully-connected deep rectifier networks:

Let’s consider a fully-connected deep network with fixed width, relu activation and compact domain and co-domain :

\begin{equation} F_\theta: X \rightarrow Y \end{equation}

\begin{equation} F(x;\theta) = f \circ h_L \circ … \circ h_1(x) = f \circ \phi(x) \end{equation}

where and the parameters of are given by:

\begin{equation} \theta = \{W_l \in \mathbb{R}^{n_l \times n_{l-1}}, b_l \in \mathbb{R}^{n_l}: I \in [L]\} \end{equation}

Now, let’s note that if has layers and nodes per layer there are:

\begin{equation} (n!)^N \end{equation}

layer-wise permutations that result in functions equivalent to so it’s more useful to think in terms of function spaces than particular parameterisations of a network.

We may also observe that similar to (11) is a linear combination of functions that are affine on and zero on . Since is characterised by the activation patterns of and is continuous on , must be compact and and must be pair-wise disjoint so:

\begin{equation} \forall i,j\neq i, X_i \cap X_{j \neq i} = \emptyset \end{equation}

\begin{equation} \forall i,j\neq i, \langle \phi_i,\phi_j \rangle = \int_X \phi_i(x) \cdot \phi_j(x) dx = 0 \end{equation}

\begin{equation} \exists A \in \mathbb{R}^{n \times n} B \in \mathbb{R}^{n}, \phi_i(x) = Ax+B \end{equation}

Now, if the reader is wondering about (16) let’s suppose that there exists such that and . Since and correspond to different activation patterns in there must be a node in with value such that and .

Function Approximation with Neural Networks:

If we are trying to approximate a continuous function by means of we may use the results developed earlier and define:

\begin{equation} \mathcal{L}[F_{\theta}] = \int_X \left\lVert F(x)-F_{\theta}(x) \right\rVert^2 dx \end{equation}

and given a suitable vector-valued polynomial approximation such that:

\begin{equation} \mathcal{L}[\tilde{F}] \approx 0 \end{equation}

it is sufficient to choose the in such that:

\begin{equation} \forall x \in B(x_0,r), \phi_i(x) = \frac{\partial \tilde{F}_n}{\partial x} \Big\rvert _{x=x_0}(x-x_0) + x_0 \end{equation}

We may also note that for a deep rectifier network with nodes, each node is either on or off so it has possible activation patterns which correspond to distinct and a partition of into pair-wise disjoint and compact subsets. In principle, this means that for networks with nodes and at least one hidden layer we should empirically observe:

\begin{equation} \mathcal{L}[F_\theta] \sim \frac{1}{2^N} \end{equation}

If there are any doubts concerning the existence of suitable polynomials, for finitely many points there are infinitely many interpolating polynomials. The interesting question is which polynomial generalises best but that is beyond the scope of the present article.

Discussion:

At this point this idea still represents a rough sketch that must be developed further. But, I think it also provides some insight into why neural networks are so good at function approximation. They are approximating the intrinsic geometry of a mapping using piece-wise linear approximations which works because we can find a suitable polynomial approximation, and all polynomials are locally Lipschitz.

Thoughts and comments are welcome!

References:

  1. Andrey Kolmogorov & S.V. Fomin. Elements of the Theory of Functions and Functional Analysis: Metric and normed spaces. Dover. 1954.
  2. W. Rudin. Real and complex analysis. McGraw-Hill. 3rd ed.1986.
  3. Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press. 2016.