A few weeks ago I spent some time reflecting on McGeer’s passive dynamic walkers-which walk smoothly down an inclined plane without any digital computation-and wondered whether a reinforcement learning algorithm could discover similar gaits for flat terrain [1]. From a reinforcement learning perspective, Policy Gradients may be the simplest approach that may be used to address bipedal dynamics yet these methods have demonstrated great effectiveness on continuous control tasks [2]. For this reason, I decided to start investigating this problem with Vanilla Policy Gradients.

A secondary motivation came from The Policy of Truth article where Ben Recht, a professor of optimisation and machine learning at Berkeley, presented a harsh critique of Policy Gradient methods and the usage of stochastic policies in particular. While Policy Gradient methods definitely have important limitations I demonstrate that the points raised by Ben Recht are non-issues. On the other hand, it must be noted that bipedal walkers can’t be reduced to reinforcement learning. If anything, considering that McGeer’s cleverly-designed walkers don’t do any computations, the embodiment of the bipedal walker must be taken seriously.

In order to get started on a simple problem1, I decided to apply Policy Gradients to unicycles which-like passive dynamic walkers-are dynamically similar to inverse pendulums.

The Policy Gradients formalism:

Trajectories, Policies and Rewards:

A trajectory(aka rollout) is a sequence of states \(s_t\) and actions \(u_t\) generated by a dynamical system:

\begin{equation} \tau_t = (u_0,…,u_t,s_0,…,s_t) \end{equation}

and a policy \(\pi_{\theta} (\cdot \lvert s)\) is a conditional distribution on an agent’s actions \(a_t \in A\) given states \(s_t \in S\). This conditional distribution is typically parametrised by a function approximator(ex. neural network) with parameters \(\vartheta\) and serves as a stochastic policy from which we can sample actions: \(a_t \sim \pi_{\vartheta} (\cdot \lvert s_{t})\).

Now, the objective of Policy Gradients is to find a policy \(\pi_{\vartheta}\) that maximises the total reward after \(H\) time steps:

\begin{equation} R(\tau) = \sum_{t=0}^H R(s_t,u_t) \end{equation}

Derivation of Policy Gradients:

Given the reward function \((2)\), our objective function is:

\begin{equation} \begin{split} U(\vartheta) & = \mathbb{E}\big[\sum_{t=0}^H R(s_t,u_t);\pi_{\vartheta}\big] \
& = \sum_{\tau} P(\tau;\vartheta) R(\tau) \end{split} \end{equation}

where \(P(\tau;\vartheta)\) is the probability distribution over trajectories induced by \(\pi_{\vartheta}\):

\begin{equation} P(\tau;\vartheta) = \prod_{t=0}^H P(s_{t+1}^i \lvert s_t^i,u_t^i) \pi_{\vartheta}(u_t^i \lvert s_t^i) \end{equation}

Having defined \((3)\), the Policy Gradients update may be derived as follows:

\begin{equation} \begin{split} \nabla_{\vartheta} U(\vartheta) & = \nabla_{\vartheta} \sum_{\tau} P(\tau;\vartheta) R(\tau) \
& = \sum_{\tau} \nabla_{\vartheta} P(\tau;\vartheta) R(\tau) \
& = \sum_{\tau} P(\tau;\vartheta) \frac{\nabla_{\vartheta} P(\tau;\vartheta)}{P(\tau;\vartheta)}R(\tau) \
& = \sum_{\tau} P(\tau;\vartheta) \nabla_{\vartheta} \ln P(\tau;\vartheta) R(\tau) \end{split} \end{equation}

A reasonable monte-carlo approximation to \((5)\) would be:

\begin{equation} \nabla_{\vartheta} U(\vartheta) \approx \widehat{g} = \sum_{i=1}^N \nabla_{\vartheta} \ln P(\tau^i;\vartheta) R(\tau^i) \end{equation}

where each \(\tau^i\) denotes a distinct trajectory generated by running a simulator with policy \(\pi_{\vartheta}\). Crucially, this approximation holds regardless of the dynamics of the environment:

\begin{equation} \begin{split} \nabla_{\vartheta} \ln P(\tau^i;\vartheta) & = \nabla_{\vartheta} \ln \big[\prod_{t=0}^H P(s_{t+1}^i \lvert s_t^i,u_t^i) \pi_{\vartheta}(u_t^i \lvert s_t^i)\big]\
& = \nabla_{\vartheta} \big[\sum_{t=0}^H \ln P(s_{t+1}^i \lvert s_t^i,u_t^i) + \sum_{t=0}^H \ln \pi_{\vartheta}(u_t^i \lvert s_t^i)\big] \
& = \nabla_{\vartheta} \sum_{t=0}^H \ln \pi_{\vartheta}(u_t^i \lvert s_t^i) \end{split} \end{equation}

What is Policy Gradients doing?

  1. Increasing the probability of paths with positive reward.
  2. Decreasing the probability of paths with negative reward.

But, what if rewards are strictly non-negative?

Reducing the variance of \(\widehat{g}\) with baselines:

In environments where there are no negative rewards we may extract more signal from our data by using baselines. For concreteness, we may use a constant baseline:

\begin{equation} \begin{split} \nabla_{\vartheta} U(\vartheta) \approx \widehat{g} & = \sum_{i=1}^N \nabla_{\vartheta} \ln P(\tau^i;\vartheta) (R(\tau^i)-b) \
b & = \frac{1}{N} \sum_{i=1}^N R(\tau^i) \end{split} \end{equation}

or even better, we may use a state-dependent baseline \(b(s_t)\) which estimates the expected value of the current state by minimising \(\lVert b(s_t)-R_t \rVert^2\) over all trajectories during training. In this case \(b\) is a value-estimator ,more often than not parametrised by a neural network, that allows the Policy Gradients model to do temporal difference learning and therefore exploit the sequential structure of the decision problem. Moreover, using the advantage estimate \(\widehat{A_t} = R_t-b(s_t)\) rather than the reward reduces variance of the gradient estimate as it extracts more signal from the observations by telling the model how much the current action is better than what is normally done in state \(s_t\).

Furthermore, it’s very useful to note subtracting baselines doesn’t introduce bias into the expectation:

\begin{equation} \begin{split} \mathbb{E}[\nabla_{\vartheta} \ln P(\tau;\vartheta) b] & = \int P(\tau;\vartheta) \nabla_{\vartheta} \ln P(\tau;\vartheta) b d\tau \
& = \int \nabla_{\vartheta} P(\tau;\vartheta) b d\tau \
& = b \nabla_{\vartheta} \int P(\tau;\vartheta) d\tau \
& = b \nabla_{\vartheta} 1 = 0 \end{split} \end{equation}

Vanilla Policy Gradients:

By putting the above ideas together, we end up with a variant of the Vanilla Policy Gradients algorithm:

  1. Initialise the policy parameter \(\vartheta\) and baseline \(b\)
  2. For \(\text{iter}=1,..,\text{maxiter}\) do:
    a. Collect a set of trajectories \(\{\tau^i\}_{i=1}^N\) by executing the current policy \(\pi_{\vartheta}\)
    b. At each time step in the trajectory \(\tau^i\) compute \(R_t = \sum_{t'=t}^{T-1} \gamma^{t'-t}r_t\) and \(\widehat{A_t} = R_t-b(s_t)\).
    c. Re-fit the baseline, by minimising \(\lVert b(s_t)-R_t \rVert^2\) summed over all trajectories and time steps.
    d. Update the policy \(\pi_{\vartheta}\) using a policy gradient estimate \(\widehat{g}\) which is a sum of terms \(\nabla_{\vartheta} \ln \pi_{\vartheta}(u_t \lvert s_t) \widehat{A_t}\)
  3. end for

This is the algorithm I used to train the unicycle controller. But, before describing my implementation I’d like to address Ben Recht’s claim that continuous control researchers have no good reason to use stochastic policies.

Why do we use stochastic policies?:

In The Policy of Truth, Ben Recht is rather dismissive of Policy Gradient methods and argues that stochastic policies are a modeling choice that is never better than using deterministic policies and optimal control methods in general. In particular, Ben argues that if the correct policy is deterministic then the probabilistic model class must must satisfy several show-stopping constraints:

  1. rich enough to approximate delta functions
  2. easy to search by gradient methods
  3. easy to sample from

Why then do reinforcement learning researchers use Policy Gradient methods? There are several important good reasons:

  1. In high-dimensional state-spaces(ex. images) where we might use convolutional neural networks to learn a low-dimensional state-representation \(\widehat{s_t} \approx s_t\), state-aliasing is practically inevitable and so it makes sense to use stochastic policies to reflect the model’s uncertainty.
  2. Sampling from a stochastic policy is a convenient method of exploring the state-space.
  3. Stochastic policies effectively smooth out rough/discontinuous reward landscapes and allow the agent to obtain reward signals it couldn’t obtain otherwise.

Moreover, Ben’s three points are actually non-issues:

  1. Neural networks, due to Cybenko’s approximation theorem, may be used to parametrise any probability distribution. Empirically, they are also easy to search in the space of expressible functions using gradient methods. This is why Variational Inference has found many practical applications and Edward has gained a lot of traction among statistical machine learning researchers and engineers.
  2. Neural networks are easy to sample from. In fact, inference with neural networks is computationally cheap.
  3. A corollary of my first point is that we may easily approximate delta functions. If your policy is a conditional Gaussian,which is what I used for controlling the unicycle, then your distribution definitely contains good approximations of delta functions.

To be honest, I am surprised that Ben Recht didn’t bring up any real issues such as curriculum learning2 and transfer learning where a lot of active research is currently ongoing[5]. Moreover, it’s important to note that Policy Gradient methods were developed for problems where optimal control theory, Ben Recht’s proposed alternative, doesn’t work at all.

Controlling a unicycle with Policy Gradients:

Modelling assumptions:

As shown in [4] iff we suppose that our unicyclist is riding on a level surface without turning, motion in the wheel plane may be modelled by a planar inverted pendulum of diameter \(l\) with horizontally moving support. After cancelling out the mass term, an analysis of the force diagrams shows that we have the following Newtonian equation of motion:

\begin{equation} \begin{split} l \ddot{\theta} & = g \sin(\theta)-\ddot{z}\cos(\theta) \
\ddot{z} & = gh(\theta,\dot{\theta}) \end{split} \end{equation}

where \(h\) may be identified with a unicycle controller \(\pi_{\vartheta}\) and represents the riders acceleration of the wheel as a reaction to the instantaneous angular momenta. By defining \(\alpha = \frac{g}{l}\) we have the simpler equation:

\begin{equation} \ddot{\theta} = \alpha(\sin(\theta)-h\cos(\theta)) \end{equation}

Now, assuming that the rider starts approximately upright(i.e. \(\theta \approx 0\)) we may linearise \((11)\) so we have:

\begin{equation} \ddot{\theta} = \alpha(\theta-h) \end{equation}

and there exist relatively simple stable solutions for this linearised equation such as:

\begin{equation} h(\theta) = a\theta, a > 1 \end{equation}

but a more sophisticated unicyclist would anticipate the future consequences of the rate \(\dot{\theta}\) as well as react to the angle of the fall \(\theta\). This way, he/she may be able to overcome the effects of a finite reaction time. For this reason, I defined the state of the unicycle system to be:

\begin{equation} s_t = \begin{bmatrix} \theta_t \
\end{bmatrix} \end{equation}

The simulator:

In order to generate rollouts with a policy \(\pi_{\vartheta}\) we must choose a method for numerically integrating \((12)\) and after some reflection I opted for Velocity Verlet due to its simplicity and numerical stability properties. Assuming that we have evaluated \(\ddot{\theta}\) and fixed a reasonable value for \(\Delta t\) we have:

\begin{equation} \begin{split} \theta_{i+1} & = \theta_i + \dot{\theta}\Delta t + \frac{1}{2}\ddot{\theta_i} \Delta t \
\dot{\theta_{i+1}} & = \dot{\theta_i} + \frac{1}{2}(\ddot{\theta_i} + \ddot{\theta_{i+1}}) \Delta t \end{split} \end{equation}

Doing this for both \(\theta\) and \(z\) in TensorFlow, I defined the following Velocity Verlet function:

def velocity_verlet(self):
        Update the unicycle system using Velocity Verlet integration. 
    ## update rules for theta:
    dd_theta = self.alpha*(tf.sin(self.theta)-self.model.action*tf.cos(self.theta))
    theta = self.theta + self.d_theta*self.time + 0.5*self.dd_theta*tf.square(self.time)
    d_theta = self.d_theta + 0.5*(dd_theta+self.dd_theta)*self.time
    ## update rules for z:
    dd_z = self.g*self.model.action
    z = self.z + self.d_z*self.time + 0.5*self.dd_z*tf.square(self.time)
    d_z = self.d_z + 0.5*(dd_z+self.dd_z)*self.time
    step =,self.theta.assign(theta),
    return step

Now, in order to encourage the policy network to discover controllers in the linear domain of the unicycle system, i.e. \((12)\), I initialised the variables at the beginning of each rollout in the following manner:

def restart(self):
            A simple method for restarting the system where the angle is taken to be a slight
            deviation from the ideal value of theta = 0.0 and the system has an important
            initial horizontal acceleration.
    step =,1),stddev=0.1)),
    return step

The key part in this snippet of code is that \(\theta_0 \sim \mathcal{N}(0,0.1)\).

The Policy Gradients model:

For the unicycle controller(i.e. the policy \(\pi_{\vartheta}\)) I defined a conditional Gaussian as follows:

def two_layer_net(self, X, w_h, w_h2, w_o,bias_1, bias_2):
            A generic method for creating two-layer networks
            input: weights
            output: neural network
        h = tf.nn.elu(tf.add(tf.matmul(X, w_h),bias_1))
        h2 = tf.nn.elu(tf.add(tf.matmul(h, w_h2),bias_2))
        return tf.matmul(h2, w_o)
    def controller(self):
            The policy gradient model is a neural network that 
            parametrises a conditional Gaussian. 
            input: state(i.e. angular momenta)
            output: action to be taken i.e. appropriate horizontal acceleration 
        with tf.variable_scope("policy_net"):
            W_h = self.init_weights([2,100],"W_h")
            W_h2 = self.init_weights([100,50],"W_h2")
            W_o = self.init_weights([50,10],"W_o")
            # define bias terms:
            bias_1 = self.init_weights([100],"bias_1")
            bias_2 = self.init_weights([50],"bias_2")
            eta_net = self.two_layer_net(self.pv,W_h, W_h2, W_o,bias_1,bias_2)
            W_mu = self.init_weights([10,1],"W_mu")
            W_sigma = self.init_weights([10,1],"W_sigma")
   = tf.multiply(tf.nn.tanh(tf.matmul(eta_net,W_mu)),self.action_bound)
            self.log_sigma = tf.multiply(tf.nn.tanh(tf.matmul(eta_net,W_sigma)),self.variance_bound)
        return, self.log_sigma

and sampling from this Gaussian is as simple as doing:

def sample_action(self):
            Samples an action from the stochastic controller which happens
            to be a conditional Gaussian. 
        dist = tf.contrib.distributions.Normal(,tf.exp(self.log_sigma))
        return dist.sample()

Likewise, you may check my Policy Gradients class to check out how I calculated the baseline.

Defining rewards:

Defining the reward might be the most tricky part of this experiment as it’s not obvious how the unicycle controller should be rewarded and from a dynamical systems perspective a reward doesn’t really make sense. Either the unicycle has a stable controller or it doesn’t. But, in order to adhere to the Policy Gradients formalism I opted for the instantaneous height of the unicycle as the reward. This is as simple as:

\begin{equation} \text{height} = l\sin(\theta) \end{equation}

and as a result the REINFORCE loss(i.e. no baseline subtracted) is simply:

def reinforce_loss(self):
            The REINFORCE loss without subtracting a baseline. 
        dist = tf.contrib.distributions.Normal(, tf.exp(self.log_sigma))
        return dist.log_prob(*self.height


After defining reasonable hyperparameters(\(\Delta t = 0.01\), rollouts per batch = 10, horizon = 30, total epochs=100,…) I ran a few tests in the following notebook and found that the learned controller was very good at bringing the unicycle to maximum height but terrible at keeping the unicycle at that height once it got there. In a word, it was very unstable.

Actually, if we evaluate the model \(\pi_{\vartheta}\) with \(\dot{\theta}=0\) we find that the unicycle exhibits highly nonlinear behaviour in the neighborhood of \(\theta = -\frac{\pi}{2}\)

So the force is negative for angles greater than \(-\frac{\pi}{2}\) and positive for angles less than \(-\frac{\pi}{2}\). Given that this is many standard deviations away from \(\theta = 0\), this is most likely due to my choice of tanh activation for the final layer of the policy network.

Furthermore, if we analyse the learned controller behaviour as a function of the full state(i.e. \(\theta\) and \(\dot{\theta}\)) we observe the following:

The combination of the short-sighted policy and the tanh nonlinearity makes me wonder whether there are small tweaks to my TensorFlow functions which may lead to much better results.


While I think the results are interesting and may be improved from a technical perspective, I don’t think any particular reward function makes sense for learning re-usable locomotion behaviours. Ideally, the agent would be able to propose and learn curriculums of locomotion behaviours in an unsupervised manner in order to learn models of its affordances and its intrinsic locomotory options. One promising approach to this was articulated by Sébastien Forestier, Yann Mollard and Pierre Oudeyer in [6] but I’ll have to think about how Intrinsically Motivated Goal Exploration can be assimilated within the Policy Gradients framework.


  1. Passive Dynamic Walking. T. McGeer. 1990.
  2. Emergence of Locomotion Behaviours in Rich Environments. Nicolas Heess et al. 2017.
  3. Policy Gradients for Reinforcement Learning with Function Approximation. Richard S. Sutton, David McAllester, Satinder Singh & Yishay Mansour. 1999.
  4. Unicycles and Bifurcations. R. C. Johnson. 2002.
  5. Modular Multitask Reinforcement Learning with Policy Sketches. Jacob Andreas, Dan Klein, and Sergey Levine. 2017.
  6. Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning. Sébastien Forestier, Yoan Mollard, and Pierre-Yves Oudeyer. 2017.


  1. In my experience it always helps to work on tractable and conceptually interesting variants of complex problems(ex. bipedal locomotion) as these problems often provide deep insights into the complex problem of interest. 

  2. I’d be interested to see how optimal control theory may be used for curriculum learning.