# Introduction:

Around the same time about a year ago, I read ‘The Physical Systems behind Optimisation algorithms’[1] and ‘Towards an Integration of Deep Learning and Neuroscience’[2]. The first paper showed a direct link between gradient-based optimisers and the dynamics of damped harmonic oscillators while the second suggested that the brain might be doing gradient-based optimisation similar to the backpropagation algorithm that has been very useful for deep learning. It occurred to me that there might be a connection between the two but it wasn’t clear to me until I printed out the ‘Physical Systems’ paper a couple days ago and thought about it.

Given my particular interest in deep reinforcement learning, aka deep RL, I make the following observations:

1. A continuous approximation of gradient-based optimisation schemes corresponds to a family of damped harmonic oscillators.
2. It follows that during training a deep RL system dissipates useful energy at an exponential rate while minimising entropy with respect to the training data distribution.
3. In this physical setting it’s clear that gradient-based optimisers have an exponential rate of convergence to local minima.
4. The exponential convergence to thermodynamic equilibrium behaviour of gradient-based optimisers means that the training and test regimes of deep RL systems have very different thermodynamic profiles. In fact, the system converges to zero neural plasticity which isn’t a problem provided that the ‘test’ environment is stationary and identical to the ‘training’ environment.
5. For biological systems on the other hand there can be no clean separation between ‘training’ and ‘test’ regimes. They must learn in a continuous manner as their natural environment is neither ergodic nor stationary, which precludes convergence to thermodynamic equilibrium in biological brains.

As a result of this analysis I propose that we look beyond gradient-based optimisation for deep RL and abandon the training and test paradigm for reinforcement learning in favor of frameworks and heuristics for continual learning. The relevance of this analysis to neuroscience researchers is that backpropagation is unlikely to be a ‘general’ optimisation scheme in the brain as suggested in one of sections of [2].

# Continuous approximation of Gradient Descent algorithms:

We shall proceed as done in [1] and express Vanilla Gradient Descent and Nesterov’s Accelerated Gradient in the following form:

$$x_{k+1}-x_k -\alpha(x_k-x_{k-1}) + \eta \nabla F(x_k +\alpha(x_k-x_{k-1})) = 0$$

where we assume that the weight decay parameter $\alpha \in [0,1)$ and $\eta \in (0,1)$ corresponds to the learning rate.

## Taylor expansions in the continuous setting:

If we define a continuous time variable $X(t) = x_{\lceil t/h \rceil} = x_k$ in terms of a time-scaling factor $h$, a Taylor expansion gives us:

and given that $x_k - x_{k-1} \approx 0$ in the continuous case we have:

$$\eta \nabla f(x_k +\alpha(x_k-x_{k-1})) = \eta \nabla F(X(t)) + \mathcal{O}(\eta h)$$

## Damped Harmonic Oscillators:

Using (2) and (3), we may re-formulate (1) as:

$$\frac{(1+\alpha)h^2}{2\eta}\ddot{X(t)}+ \frac{(1-\alpha)h}{\eta}\dot{X(t)}+ \nabla F(X(t)) + \mathcal{O}(h) = 0$$

and in the limit as $h \rightarrow 0$ we recover a damped oscillator system:

$$m\ddot{X(t)} + B\dot{X(t)}+ \nabla F(X(t)) = 0$$

where $m:=\frac{(1+\alpha)h^2}{2\eta}$ refers to the particle mass, $B:=\frac{(1-\alpha)h}{\eta}$ refers to the damping coefficient and the function $F$ describes the potential field.

# Exponential rate of convergence of damped oscillators:

## Linear approximation:

Given the expression (4) we note that if $h$ is sufficiently small then an Euler integration scheme is valid for simulating the damped oscillator system and this corresponds to using a linear approximation of $F$ in a neighborhood of $X(t)$ so we have:

$$F(X(t)) = A_tX(t) + C_t$$

and (5) now simplifies to:

$$m\ddot{X(t)} + B\dot{X(t)}+ A_t = 0$$

where we assume that $\lim_{t \to \infty} \lVert A_t \rVert = 0$ which corresponds to attaining a local minimum.

## Exponential closed form:

Now, if we solve (7) we obtain the following closed form expression:

$$X(t) = -\frac{A_t t}{B} + \frac{C_1 m e^{-Bt/m}}{B} + C_2$$

so $X(t) \rightarrow 0$ at an exponential rate.

# The thermodynamics of deep learning:

It’s important to note that due to (8), during the training of a deep RL system the kinetic energy of the associated damped oscillator decays at an exponential rate:

$$K.E. = \frac{1}{2}m\dot{X(t)}^2 \sim \frac{1}{2}m C_1^2 e^{-2Bt/m}$$

This is what we would expect of a dissipative system. In some sense the deep learning system is crystallizing during training as gradient-based optimisation corresponds to entropy minimisation with respect to the training data distribution. Furthermore, this means that the training and test regimes of a deep learning system have very different thermodynamic profiles. Is this a reasonable framework for reinforcement learning?

For biological systems that learn there is no clean separation between ‘training’ and ‘test’ regimes. They must learn in a continuous manner as their natural environment is neither ergodic nor stationary.

# Proposed research directions:

Given the insights revealed by the duality of gradient-based optimisers and damped harmonic oscillators, I propose the following:

1. Reinforcement Learning researchers should look for alternatives to gradient-based adaptation schemes for reinforcement learning agents that encourage lifelong neural plasticity.
2. We abandon the training and test paradigm in favor of frameworks and heuristics for continuous and open-ended learning. One example is the learning progress measure that has been proposed by Adrien Baranès and Pierre-Oudeyer [3].

That said, this problem appears to have no easy solution and I believe the only way forward is for the reinforcement learning community to think carefully about it.

# References:

1. Lin F. Yang, R. Arora, V. Braverman, & Tuo Zhao. The Physical Systems Behind Optimization Algorithms. 2016.
2. A. Marblestone, G. Wayne & K. Kording. Towards an Integration of Deep Learning and Neuroscience. 2016.
3. A. Baranes & P. Oudeyer. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning. 2009.