Reece O'mahoney

Offline Adaptation of Quadruped Locomotion using Diffusion Models

Reece O'Mahoney, Alexander L. Mitchell, Wanming Yu,
Ingmar Posner, Ioannis Havoutis

Oxford Robotics Institute, University of Oxford

Abstract

We present a diffusion-based approach to quadrupedal locomotion that simultaneously addresses the limitations of learning and interpolating between multiple skills (modes) and of offline adapting to new locomotion behaviours after training. This is the first framework to apply classifier-free guided diffusion to quadruped locomotion and demonstrate its efficacy by extracting goal-conditioned behaviour from an originally unlabelled dataset. We show that these capabilities are compatible with a multi-skill policy and can be applied with little modification and minimal compute overhead, i.e., running entirely on the robot&pos;s onboard CPU. We verify the validity of our approach with hardware experiments on the ANYmal quadruped platform.

Approach

Method diagram

Method Overview: a) A reinforcement learning agent is pre-trained with a hand crafted policy that generates reference trajectories. These are collected by rolling out the policy in an environment with randomised parameters for robustness. b) An embedding of the observation is concatenated with separate embeddings of the diffusion timestep, skill, and return. These together form the conditioning input. The multi-head transformer decoder initially takes a noise vector as input, applies causal self-attention, then cross-attention with the conditioning and produces a partially denoised vector. This process is repeated N times to produce a complete action trajectory. c) The return value is randomly masked during training, allowing us to use classifier-free guidance at test time. This is done by taking a weighted sum of unconditional and maximum return trajectories at each denoising step.

Results

Results diagram

After training, the model&pos;s outputs are adjusted to recover different locomotion behaviours. The table above shows the velocity tracking error of policies trained with different target velocities. We compare an expert model with access to the ground truth commands to our model that has no access to these commands but instead aims to maximise a reward function via classifier-free guidance. Our results demonstrate that our method, without access to ground truth commands, can produce comparable velocity tracking to a model that does by using reward guidance.

Results diagram

We collect separate datasets generated by walking and crawling reinforcement learning policies with no transitions present between the two. Our model was able to learn interpolations between these two skills which were remarkably stable over the full range of velocity commands in the dataset. The bottom row of the above figure shows snapshots from one of these transitions when deployed on real hardware.