Nesterov Accelerated Gradient is basically gradient descent with a momentum term, which may be expressed as follows:

\begin{equation} (x_{k+1}-x_k) - \alpha \cdot (x_k -x_{k-1}) + \eta \cdot \nabla f(x_k + \alpha \cdot (x_k -x_{k-1})) = 0 \end{equation}

where \(\alpha\) is a damping term and \(\eta\) is the learning rate.

In order to perform a continuum-limit approximation of (1), we may define:

\begin{equation} t = k \cdot h \end{equation}

\begin{equation} X(t) := x_{\lfloor t/h \rfloor} = x_k \end{equation}

where we have \(x_k = X(t)\) and therefore:

\begin{equation} X(t+h) = X(t) + \dot{X}(t) \cdot h + \frac{1}{2} \ddot{X}(t) \cdot h^2 + \mathcal{O}(h^3) \end{equation}

\begin{equation} X(t-h) = X(t) - \dot{X}(t) \cdot h + \frac{1}{2} \ddot{X}(t) \cdot h^2 + \mathcal{O}(h^3) \end{equation}

This allows us to derive the following continuous-time approximations:

\begin{equation} x_{k+1}-x_k = \dot{X}(t) \cdot h + \frac{1}{2} \ddot{X}(t) \cdot h^2 \end{equation}

\begin{equation} x_{k}-x_{k-1} = \dot{X}(t) \cdot h - \frac{1}{2} \ddot{X}(t) \cdot h^2 \end{equation}

\begin{equation} \eta \cdot \nabla f(x_k + \alpha \cdot (x_k -x_{k-1})) = \eta \cdot \nabla f(X(t)) \end{equation}

and so in the continuum-limit we have the differential equation for a Damped Harmonic Oscillator:

\begin{equation} m \cdot \ddot{X}(t) + c \cdot \dot{X}(t) + \nabla f(X(t)) = 0 \end{equation}

where \(m:= \frac{(1+\alpha) \cdot h^2}{2 \eta}\) is the particle mass, \(c:=\frac{(1-\alpha) \cdot h}{\eta}\) is the damping coefficient and \(f(\cdot)\) is the potential field.

Therefore, from an optimisation perspective the equilibrium is essentially the minimiser of the potential function.

References:

  1. Lin F. Yang et al. The Physical Systems Behind Optimization Algorithms. Neurips. 2016.

  2. Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). Doklady ANSSSR (translated as Soviet.Math.Docl.), vol. 269, pp. 543– 547.