Kepler Lounge: Bellman's Principle of Optimality and the Hamilton-Jacobi-Bellman equation

Aidan Rocke

Motivation:

Any dynamical system that has a variational formulation admits an optimal controller within the framework of Hamilton-Jacobi-Bellman theory. In particular, the Hamilton-Jacobi-Bellman equation applies to deep reinforcement learning systems as stochastic gradient descent implicitly performs variational inference [3].

An important consequence of this theory is that the scalar field property of the Bellman equation in Reinforcement Learning doesn’t depend upon an anthropic principle.

Optimal Non-linear control:

Given a non-linear dynamical system:

\[\begin{equation} \dot{x} = f(x(t),u(t),t) dt \tag{1} \end{equation}\]

we may design a controller \(u(t)\) for \(x(t)\) that minimizes a cost \(J\):

\[\begin{equation} J(x(t),u(t),t_0,t_f)=\int_{t_0}^{t_f} \mathcal{L}(x(\tau),u(\tau)) d\tau \tag{2} \end{equation}\]

where the optimal controller \(\hat{u}(t)\) is defined relative to a value function:

\[\begin{equation} V(x(t_0),t_0,t_f) = \min_{u(t)} J(x(t),u(t),t_0,t_f) \tag{3} \end{equation}\]

that only depends upon the initial state and the final state of the system.

Hamilton-Jacobi-Bellman equation:

From a dynamical systems perspective, we may derive the following evolution equation relative the optimal controller \(\hat{u}(t)\):

\[\begin{equation} -\frac{\partial V}{\partial t} = \min_{u(t)} \big(\big(\frac{\partial V}{\partial x}\big)^T f(x(t),u(t)) + \mathcal{L}(x(t),u(t)) \big) \tag{4} \end{equation}\]

where the continuity of \(V\) depends upon Bellman’s Principle of Optimality:

\[\begin{equation} V(x(t_0),t_0,t_f) = V(x(t_0),t_0,t) + V(x(t),t,t_f) \tag{5} \end{equation}\]

Lemma: Bellman’s Principle of Optimality:

Starting from the Bellman equation for Markov Decision Processes,

\[\begin{equation} V^{\pi}(s_0) = \mathbb{E}_{\pi}\big[\sum_{k=0}^{\infty} \gamma^k r_k \lvert s = s_0 \big] = \mathbb{E}_{\pi}\big[\sum_{k=0}^{t} \gamma^k r_k + \sum_{k=t}^{\infty} \gamma^k r_k \lvert s=s_0\big] \tag{6} \end{equation}\]

which yields the Bellman principle of optimality:

\[\begin{equation} V^{\pi}(s_0) = V^{\pi}(s_0 \rightarrow s_k) + V^{\pi}(s_t \rightarrow s_{\infty}) \tag{7} \end{equation}\]

Deriving the Hamilton-Jacobi-Bellman equation:

Using the total derivative and setting \(t:= t_0\), we have:

\[\begin{equation} \frac{d}{dt} V(x(t),t,t_f) = \frac{\partial V}{\partial t} + \big(\frac{\partial V}{\partial x})^T\frac{dx}{dt} = \min_{u(t)} \frac{d}{dt}\big(\int_t^{t_f} \mathcal{L}(x(\tau),u(\tau))d\tau\big) = \min_{u(t)}\big(-\mathcal{L}(x(t),u(t))\big) \tag{8} \end{equation}\]

which yields the evolution equation:

\[\begin{equation} -\frac{\partial V}{\partial t} = \min_{u(t)} \big(\big(\frac{\partial V}{\partial x}\big)^T f(x(t),u(t)) + \mathcal{L}(x(t),u(t)) \big) \tag{9} \end{equation}\]

Discussion:

In principle, this theory may be generalised to stochastic dynamical systems. However, in the case of information-processing the appropriate theory is variational inference.

From the vantage point of Algorithmic Probability theory, variational inference is sound as the Shannon Entropy of a random variable corresponds to the Minimum Description Length of its associated probability distribution.

References:

Rocke (2022, Oct. 6). Kepler Lounge: Kolmogorov’s theory of Algorithmic Probability. Retrieved from keplerlounge.com
Rocke (2022, Nov. 4). Kepler Lounge: Money as a scalar field via Reinforcement Learning. Retrieved from keplerlounge.com
Pratik Chaudhari, Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. Arxiv. 2018.

Bellman’s Principle of Optimality and the Hamilton-Jacobi-Bellman equation