Divide-and-Conquer Reinforcement Learning

The idea

Problems that exhibit high initial station variation produce high variance policy-gradient estimates and are hard to solve via direct policy or value function optimization. This paper provides a novel algorithm that partitions the initial state into “slices” and optimizes an ensemble of policies over these slices, which are then unified into a central policy.

The method

Initial states are sampled and then clustered using k-means into contexts \(\omega_i\), each associated with a policy \(\pi_i(s,a) = \pi((\omega_i,s),a)\). A central policy is defined as \(\pi_c(s,a)=\sum _{\omega\in\Omega} p({\omega}{\mid}s) \pi_{\omega} (s,a)\) Each policy should stay as close to the central policy as possible by maximizing \(\eta(\pi_i) - \alpha \mathbb{E}[D_{KL}(\pi_i{\mid}{\mid}\pi_c)] {\forall i}\). They also want to keep the divergence between policies w.r.t

To update the policies, the authors use the following loss, devired from TRPO¹:

Then, the central policy is updated (the authors call this the “distillation step”):

Finally, here is the full algorithm:

The results

TRPO is considered state of the art in reinforcement learning Distral is another RL algorithm that splits the context, but uses a central policy learned through supervised learning
Unconstrained DnC means that DnC is executed without KL constraints. This reduces to running TRPO on local policies.
Centralized DnC is like running Distral but still updating the central algorithm.
Picking requires the robotic arm to pick up a randomly-placed block and lift it as high as possible
Lobbing requires picking a block and “lobbing” it in a randomly placed square
Catching requires the arm to catch a ball that is thrown at it at a random velocity and initial position
Ant requires controlling an ant to walk to a random flagged position
Stairs requires a bipedal robot to climb a set of stairs of varying heights and lengths

Finally, here’s a video of DnC versus TRPO