The idea

Problems that exhibit high initial station variation produce high variance policy-gradient estimates and are hard to solve via direct policy or value function optimization. This paper provides a novel algorithm that partitions the initial state into “slices” and optimizes an ensemble of policies over these slices, which are then unified into a central policy.

The method

Initial states are sampled and then clustered using k-means into contexts \(\omega_i\), each associated with a policy \(\pi_i(s,a) = \pi((\omega_i,s),a)\). A central policy is defined as \(\pi_c(s,a)=\sum _{\omega\in\Omega} p({\omega}{\mid}s) \pi_{\omega} (s,a)\) Each policy should stay as close to the central policy as possible by maximizing \(\eta(\pi_i) - \alpha \mathbb{E}[D_{KL}(\pi_i{\mid}{\mid}\pi_c)] {\forall i}\). They also want to keep the divergence between policies w.r.t

To update the policies, the authors use the following loss, devired from TRPO1:

Then, the central policy is updated (the authors call this the “distillation step”):

Finally, here is the full algorithm:

The results

  • TRPO is considered state of the art in reinforcement learning Distral is another RL algorithm that splits the context, but uses a central policy learned through supervised learning
  • Unconstrained DnC means that DnC is executed without KL constraints. This reduces to running TRPO on local policies.
  • Centralized DnC is like running Distral but still updating the central algorithm.
  • Picking requires the robotic arm to pick up a randomly-placed block and lift it as high as possible
  • Lobbing requires picking a block and “lobbing” it in a randomly placed square
  • Catching requires the arm to catch a ball that is thrown at it at a random velocity and initial position
  • Ant requires controlling an ant to walk to a random flagged position
  • Stairs requires a bipedal robot to climb a set of stairs of varying heights and lengths

Finally, here’s a video of DnC versus TRPO