PerAct2: Benchmarking and Learning for
Robotic Bimanual Manipulation Tasks

Abstract

Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of table- top tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark, which comprises 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the-art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent – PerAct2, an extension of the PerAct framework. This method enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals.



Benchmark

Benchmarking Bimanual Robotic Manipulation Tasks

To benchmark, we extend RLBench to the complex bimanual case by adding functionality and tasks for bimanual manipulation RLBench is a robot learning benchmark suite consisting of more than 100 tasks to facilitate robot learning, which is widely used in the community. Among task diversity other key properties include reproducibility or the ability to adapt to different learning strategies. We extend RLBench to bimanual manipulation, while keeping the functionality and its key properties. This allows us to quantify the success of our method and to compare it against other baselines. Compared to unimanual manipulation, bimanual manipulation is more challenging as it requires different kinds of coordination and orchestration of the two arms. For the implementation side this makes it much more complex since synchronization is required when controlling both arms at the same time.
The following table classifies the tasks according to the bimanual taxonomy of Krebs et al.. Here, key distinguishing factors are the coupling as well as the required coordination between the two arms. We extended the classification in that we also distinguish between physical coupling, i.e., if one arm exerts a force that could be measured by the other arm.

Classification of the bimanual tasks
Task Coupled Coordination
temporal spatial physical symmetric synchronous
(a) push box
(b) lift a ball
(c) push two buttons
(d) pick up a plate
(e) put item in drawer
(f) put bottle in fridge
(g) handover an item
(h) pick up notebook
(i) straighten rope
(j) sweep dust pan
(k) lift tray
(l) handover item (easy)
(m) take tray out of oven

PerAct2

A Perceiver-Actor Framework for Bimanual Robotic Manipulation Tasks

The system architecture. PerAct2 takes proprioception, RGB-D camera images as well as a task description as input. The voxel grid is constructed by merging data from multiple RGB-D cameras. A PerceiverIO transformer is utilized to learn features at both the voxel and language levels. The output for each robot arm includes a discretized action, which comprises a six-dimensional end-effector pose, the state of the gripper, and an extra indicator for planning motion with collision awareness.



The system architecture

Results

Real-world experiments

The system architecture is robot agnostic since it predicts a 6-D pose. The method also works in real-world and can be easily transfered to other robots, e.g. a humanoid robot.

Simulation Results

Single task results

Method
trained with
demos per task, evaluated on
episode


Task success rates for single task training
Method (a) box (b) ball (c) buttons (d) plate (e) drawer (f) fridge (g) handover
ACT 0% 36% 4% 0% 13% 0% 0%
RVT-LF 52% 17% 39% 3% 10% 0% 0%
PerAct-LF ⭐ 57% 40% 10% 2% ⭐ 27% 0% 0%
PerAct2 (ours) 6% ⭐ 50% ⭐ 47% ⭐ 4% 10% ⭐ 3% ⭐ 11%
Method (h) laptop (i) rope (j) dust (k) tray (l) handover easy (m) oven
ACT 0% 16% 0% 6% 0% 2%
RVT-LF 3% 3% 0% 6% 0% 3%
PerAct-LF 11% 21% ⭐ 28% ⭐ 14% 9% 8%
PerAct2 (ours) ⭐ 12% ⭐ 24% 0% 1% ⭐ 41% ⭐ 9%