## Introduction:

The typical deep neural network tutorial introduces deep networks as compositions of nonlinearities and affine transforms. In fact, a deep network with relu activation simplifies to a linear combination of affine transformations with compact support. But, why would affine transformations be useful for nonlinear regression?

After some reflection it occurred to me that the reason why they work is that they are actually first-order Taylor approximations and by this logic partial derivatives, i.e. Jacobians, are computational primitives for both inference and learning. Of course this also depends on whether machine learning researchers are doing physics or stamp collecting.

## Approximating continuous functions with piece-wise linear functions:

A piece-wise linear function $f:\mathbb{R} \rightarrow \mathbb{R}$ is defined as follows:

$$f(x) = \sum_i \lambda_i(x) \cdot 1_{X_i}(x)$$

where $1_{X_i}$ is the indicator function and the $\lambda_i$ are affine functions. Clearly, every polynomial $p: \mathbb{R} \rightarrow \mathbb{R}$ may be approximated using piecewise linear functions since they are locally Lipschitz.

Now, let’s consider a nonlinear continuous function with compact domain and compact co-domain $X,Y \subset \mathbb{R}^3$:

$$F: X \rightarrow Y$$

Since the polynomials are dense in the space of continuous functions, due to Stone-Weierstrass, we may substitute $F$ with a polynomial of degree $n$, $F_n \in P_n$ such that if we define the functional:

$$\mathcal{L}[F_n] = \int_X \left\lVert F(x)-F_n(x) \right\rVert^2 dx$$

the sequence $\{\hat{F}_n\}_{n=1}^\infty$ converges uniformly to $F$:

$$\hat{F}_n = \underset{F_n \in P_n}{\arg\min} \mathcal{L}[F_n]$$

$$\lim_{n \to \infty} \mathcal{L}[\hat{F}_n] = 0$$

Moreover, due to the Lipschitz condition (3) also allows us to define the maximum local error with respect to a ball $B(x_0,r)$:

$$\exists C > 0, E_n(r)= \underset{x_0 \in X}{\max} \int_{B(x_0,r)} \left\lVert F(x)-\hat{F}_n(x) \right\rVert^2 dx \leq C \frac{4}{3}\pi r^5$$

which then allows us to introduce the first-order Taylor approximation:

$$\forall x \in B(x_0,r), \phi(x) = \frac{\partial \hat{F}_n}{\partial x} \Big\rvert _{x=x_0}(x-x_0) + x_0$$

where $\frac{\partial \hat{F}_n}{\partial x}$ is a Jacobian. This will allow us to construct a linear basis of $\phi_i$ for $F$.

## Constructing a linear basis for continuous functions:

We may construct a linear basis of $\phi_i$ for $F$ using (7) where the number of piece-wise linear $\phi_i$ that allows us to approximate $F$ is given by:

$$N(r) = \frac{,74 \cdot \text{Vol}(X)}{\frac{4}{3}\pi r^3} \approx \frac{\text{Vol}(X)}{3} \cdot r^{-3}$$

where $,74$ is the density of closely-packed spheres in $\mathbb{R}^3$. Furthermore, the $\phi_i$ have compact support and pairwise disjoint domains by construction so we have:

$$\forall i,j\neq i, \langle \phi_i,\phi_j \rangle = \int_X \phi_i(x) \cdot \phi_j(x) dx = 0$$

and therefore we may approximate $F$ as follows:

$$F(x) \approx \hat{F_r}(x) = \sum_{i=1}^{N(r)} \phi_i(x)$$

We also find the following useful upper-bound using (6) and (8):

$$\mathcal{L}[\hat{F_r}] = \int_X \left\lVert F(x)-\hat{F_r}(x) \right\rVert^2 dx \leq C \cdot \text{Vol}(X) \cdot r^2$$

Now, the challenge is essentially to find the first-order Taylor approximations and I will try to show that this is basically what is being done by a deep neural network.

## Fully-connected deep rectifier networks:

Let’s consider a fully-connected deep network with fixed width, relu activation and compact domain $X \subset \mathbb{R}^n$ and co-domain $Y \subset \mathbb{R}^n$:

$$F_\theta: X \rightarrow Y$$

$$F(x;\theta) = f \circ h_L \circ … \circ h_1(x) = f \circ \phi(x)$$

where $relu(x)=\max(0,x)$ and the parameters of $F_\theta$ are given by:

$$\theta = \{W_l \in \mathbb{R}^{n_l \times n_{l-1}}, b_l \in \mathbb{R}^{n_l}: I \in [L]\}$$

Now, let’s note that if $F_\theta$ has $N$ layers and $n$ nodes per layer there are:

$$(n!)^N$$

layer-wise permutations that result in functions equivalent to $F_\theta$ so it’s more useful to think in terms of function spaces than particular parameterisations of a network.

We may also observe that similar to (11) $F_\theta$ is a linear combination of functions $\phi_i$ that are affine on $X_i$ and zero on $X \setminus X_i$. Since $\phi_i$ is characterised by the activation patterns of $F_\theta$ and $\phi_i$ is continuous on $X_i$, $X_i$ must be compact and $X_i$ and $X_j$ must be pair-wise disjoint so:

$$\forall i,j\neq i, X_i \cap X_{j \neq i} = \emptyset$$

$$\forall i,j\neq i, \langle \phi_i,\phi_j \rangle = \int_X \phi_i(x) \cdot \phi_j(x) dx = 0$$

$$\exists A \in \mathbb{R}^{n \times n} B \in \mathbb{R}^{n}, \phi_i(x) = Ax+B$$

Now, if the reader is wondering about (16) let’s suppose that there exists $x \in X$ such that $x \in X_i$ and $x \in X_j$. Since $\phi_i$ and $\phi_j$ correspond to different activation patterns in $F_\theta$ there must be a node in $F_\theta(x)$ with value $\alpha \in \mathbb{R}$ such that $\alpha > 0$ and $\alpha \leq 0$.

## Function Approximation with Neural Networks:

If we are trying to approximate a continuous function $F:X \rightarrow Y$ by means of $F_\theta$ we may use the results developed earlier and define:

$$\mathcal{L}[F_{\theta}] = \int_X \left\lVert F(x)-F_{\theta}(x) \right\rVert^2 dx$$

and given a suitable vector-valued polynomial approximation $\tilde{F}$ such that:

$$\mathcal{L}[\tilde{F}] \approx 0$$

it is sufficient to choose the $\phi_i$ in $F_\theta$ such that:

$$\forall x \in B(x_0,r), \phi_i(x) = \frac{\partial \tilde{F}_n}{\partial x} \Big\rvert _{x=x_0}(x-x_0) + x_0$$

We may also note that for a deep rectifier network with $N$ nodes, each node is either on or off so it has $2^N$ possible activation patterns which correspond to distinct $\phi_i$ and a partition of $X$ into $2^N$ pair-wise disjoint and compact subsets. In principle, this means that for networks $F_\theta$ with $N$ nodes and at least one hidden layer we should empirically observe:

$$\mathcal{L}[F_\theta] \sim \frac{1}{2^N}$$

If there are any doubts concerning the existence of suitable polynomials, for finitely many points $(p,q) \in \mathbb{R}^n \times \mathbb{R}^m$ there are infinitely many interpolating polynomials. The interesting question is which polynomial generalises best but that is beyond the scope of the present article.

## Discussion:

At this point this idea still represents a rough sketch that must be developed further. But, I think it also provides some insight into why neural networks are so good at function approximation. They are approximating the intrinsic geometry of a mapping using piece-wise linear approximations which works because we can find a suitable polynomial approximation, and all polynomials are locally Lipschitz.