Kepler Lounge: Money as a scalar field via Reinforcement Learning

Aidan Rocke

Introduction:

The fundamental objective of any intelligent system is to reliably and perceptibly control its environment. As we shall demonstrate, such a control problem may be solved by optimising an objective function(aka reward function) via the Bellman equation. Moreover, in order to generalize this to an arbitrarily large number of agents it is necessary and sufficient to distribute this reward using the Bellman equation.

Thus, we show that money naturally emerges as a scalar field as an instrument for large-scale collaboration within the setting of multi-agent reinforcement learning which subsumes all problems in applied game theory.

Defining meta-policies:

Considering the case of multi-agent systems, we may define the meta-policy \(\pi\) over the state-space \(S\) and action space \(\mathcal{A}\):

\[\begin{equation} \pi: S \times \mathcal{A} \rightarrow \mathcal{A} \tag{1} \end{equation}\]

and we may define agents that are members of the collective by partitioning the state-action space \(S \times \mathcal{A}\) into subspaces \(H_i = S_i \times \mathcal{A}_i\) such that:

\[\begin{equation} \bigcup_{i=1}^N H_i = S \times \mathcal{A} \tag{2} \end{equation}\]

\[\begin{equation} H_i \bigcap H_{j \neq i} = \emptyset \tag{3} \end{equation}\]

\[\begin{equation} \pi_i: H_i \rightarrow \mathcal{A}_i \tag{4} \end{equation}\]

Thus, it’s fair to represent \(\pi\) as a sum over sub-policies:

\[\begin{equation} \pi = \sum_{i=1}^N \pi_i \tag{5} \end{equation}\]

where each sub-policy solves a distinct problem, or sub-game.

In general, this phenomenon is emergent within the setting of Deep Reinforcement Learning where a pareto-optimal solution emerges from locally competitive networks [3]. In fact, using the Universal Approximation theorem for neural networks such a function \(\pi\) may be identified with a deep neural network with ReLU activation. If this network \(\pi_{\theta}\) has \(n\) hidden nodes then it follows that its domain has at most \(m \leq 2^n\) pair-wise disjoint components:

\[\begin{equation} \pi_{\theta} = \sum_{i=1}^m \pi_{\theta}^i \tag{6} \end{equation}\]

Now, what remains to be done is to explicitly define a reinforcement learning environment and its associated reward function. This may be accomplished using the formalism of Markov Decision Processes.

Markov Decision Processes:

A MDP is a stochastic process satisfying the Markov Property. While the probability distribution of futures states is generally a function of all previous states:

\[\begin{equation} P(s_{t+1}=s, r_{t+1}=r \lvert \cdot) = P(s_{t+1}=s, r_{t+1}=r \lvert s_t, a_t, r_t, s_{t-1},a_{t-1},r_{t-1},...,s_0,a_0) \tag{7} \end{equation}\]

in a process satisfying the Markov Property it is only a function of the current state:

\[\begin{equation} P(s_{t+1}=s, r_{t+1}=r \lvert s_t,a_t) = P(s_{t+1}=s, r_{t+1}=r \lvert \cdot) \tag{8} \end{equation}\]

for all possible histories.

Using the MDP formalism, we may define a finite MDP as \(\mathcal{M}=(S, \mathcal{A},P,R,\gamma)\) where \(S=\{s_i\}_{i=1}^n\), \(\mathcal{A}=\{a_i\}_{i=1}^m\), and \(\gamma \in (0,1)\) denotes a discount factor.

Thus, may define the state-transition probability distribution \(P\) as:

\[\begin{equation} P_a(s',s) = P(s_{t+1}=s' \lvert s_t = s,a_t = a) \tag{9} \end{equation}\]

Likewise, we may define the reward function \(R\) as:

\[\begin{equation} R_a(s',s) = \sum_{r} r \cdot P(r_{t+1} = r \lvert s_t = s,a_t = a, s_{t+1}=s') \tag{10} \end{equation}\]

so for a given state-action pair the expected reward is:

\[\begin{equation} R(s,a) = \sum_{s' \in S} P_a(s,s') \cdot R_a(s,s') \tag{11} \end{equation}\]

Value functions:

A value function \(V^{\pi}\) represents the expected future reward for a current state, given the policy \(\pi\):

\[\begin{equation} V^{\pi}(s) = \mathbb{E}_{\pi}\big[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \lvert s_t = s\big] \tag{12} \end{equation}\]

Likewise, we may define the action-value function that describes the expected reward for taking action \(a\) in state \(s\) and following policy \(\pi\):

\[\begin{equation} Q^{\pi}(s,a) = \mathbb{E}_{\pi} \big[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \lvert s_t = s,a_t = a\big] \tag{13} \end{equation}\]

which allows effective policy evaluation.

The connection between (11) and (12) is given by:

\[\begin{equation} V^{\pi}(s) = \sum_{a} \pi(s,a) \cdot Q^{\pi}(s,a) \tag{14} \end{equation}\]

Moreover, using the Markov Property we may derive the Bellman equation:

\[\begin{equation} V^{\pi}(s) = \mathbb{E}_{\pi}\big[r_t + \gamma \cdot V^{\pi}(s_{t+1}) \lvert s_t = s \big] \tag{15} \end{equation}\]

which may be solved using backwards induction.

Discussion:

The Bellman equation is a discrete-time variant of the Hamilton-Jacobi-Bellman equation that guarantees necessary and sufficient conditions for optimality of control with respect to an objective function. Thus, we may effectively distribute value to sub-policies (5) and a currency naturally emerges as an instrument for large-scale collaboration.

Finally, we may deduce that this money inherits the scalar-field property of the value function in the Bellman equation (15) which provides us with a theoretically sound justification for the scalar-field property of money.

References:

Sutton, R. S., Barto, A. G., Reinforcement Learning: An Introduction. MIT Press, 1998.
Bellman, R.; Dreyfus, S. (1959). “An Application of Dynamic Programming to the Determination of Optimal Satellite Trajectories”. J. Br. Interplanet. Soc. 17: 78–83.
Rupesh Kumar Srivastava, Jonathan Masci, Faustino Gomez, Jürgen Schmidhuber. Understanding Locally Competitive Networks. Arxiv. 2015.

Money as a scalar field via Reinforcement Learning