Any dynamical system that has a variational formulation admits an optimal controller within the framework of Hamilton-Jacobi-Bellman theory.

Any dynamical system that has a variational formulation admits an optimal controller within the framework of Hamilton-Jacobi-Bellman theory. In particular, the Hamilton-Jacobi-Bellman equation applies to deep reinforcement learning systems as stochastic gradient descent implicitly performs variational inference [3].

An important consequence of this theory is that the scalar field property of the Bellman equation in Reinforcement Learning doesn’t depend upon an anthropic principle.

Given a non-linear dynamical system:

\[\begin{equation} \dot{x} = f(x(t),u(t),t) dt \tag{1} \end{equation}\]

we may design a controller \(u(t)\) for \(x(t)\) that minimizes a cost \(J\):

\[\begin{equation} J(x(t),u(t),t_0,t_f)=\int_{t_0}^{t_f} \mathcal{L}(x(\tau),u(\tau)) d\tau \tag{2} \end{equation}\]

where the optimal controller \(\hat{u}(t)\) is defined relative to a value function:

\[\begin{equation} V(x(t_0),t_0,t_f) = \min_{u(t)} J(x(t),u(t),t_0,t_f) \tag{3} \end{equation}\]

that only depends upon the initial state and the final state of the system.

From a dynamical systems perspective, we may derive the following evolution equation relative the optimal controller \(\hat{u}(t)\):

\[\begin{equation} -\frac{\partial V}{\partial t} = \min_{u(t)} \big(\big(\frac{\partial V}{\partial x}\big)^T f(x(t),u(t)) + \mathcal{L}(x(t),u(t)) \big) \tag{4} \end{equation}\]

where the continuity of \(V\) depends upon Bellman’s Principle of Optimality:

\[\begin{equation} V(x(t_0),t_0,t_f) = V(x(t_0),t_0,t) + V(x(t),t,t_f) \tag{5} \end{equation}\]

Starting from the Bellman equation for Markov Decision Processes,

\[\begin{equation} V^{\pi}(s_0) = \mathbb{E}_{\pi}\big[\sum_{k=0}^{\infty} \gamma^k r_k \lvert s = s_0 \big] = \mathbb{E}_{\pi}\big[\sum_{k=0}^{t} \gamma^k r_k + \sum_{k=t}^{\infty} \gamma^k r_k \lvert s=s_0\big] \tag{6} \end{equation}\]

which yields the Bellman principle of optimality:

\[\begin{equation} V^{\pi}(s_0) = V^{\pi}(s_0 \rightarrow s_k) + V^{\pi}(s_t \rightarrow s_{\infty}) \tag{7} \end{equation}\]

Using the total derivative and setting \(t:= t_0\), we have:

\[\begin{equation} \frac{d}{dt} V(x(t),t,t_f) = \frac{\partial V}{\partial t} + \big(\frac{\partial V}{\partial x})^T\frac{dx}{dt} = \min_{u(t)} \frac{d}{dt}\big(\int_t^{t_f} \mathcal{L}(x(\tau),u(\tau))d\tau\big) = \min_{u(t)}\big(-\mathcal{L}(x(t),u(t))\big) \tag{8} \end{equation}\]

which yields the evolution equation:

\[\begin{equation} -\frac{\partial V}{\partial t} = \min_{u(t)} \big(\big(\frac{\partial V}{\partial x}\big)^T f(x(t),u(t)) + \mathcal{L}(x(t),u(t)) \big) \tag{9} \end{equation}\]

In principle, this theory may be generalised to stochastic dynamical systems. However, in the case of information-processing the appropriate theory is variational inference.

From the vantage point of Algorithmic Probability theory, variational inference is sound as the Shannon Entropy of a random variable corresponds to the Minimum Description Length of its associated probability distribution.

- Rocke (2022, Oct. 6). Kepler Lounge: Kolmogorov’s theory of Algorithmic Probability. Retrieved from keplerlounge.com
- Rocke (2022, Nov. 4). Kepler Lounge: Money as a scalar field via Reinforcement Learning. Retrieved from keplerlounge.com
- Pratik Chaudhari, Stefano Soatto. Stochastic gradient descent
performs variational inference, converges to limit cycles for deep
networks. Arxiv. 2018.

For attribution, please cite this work as

Rocke (2022, Nov. 8). Kepler Lounge: Bellman's Principle of Optimality and the Hamilton-Jacobi-Bellman equation. Retrieved from keplerlounge.com

BibTeX citation

@misc{rocke2022bellman's, author = {Rocke, Aidan}, title = {Kepler Lounge: Bellman's Principle of Optimality and the Hamilton-Jacobi-Bellman equation}, url = {keplerlounge.com}, year = {2022} }