RoboPianist: Dexterous Piano Playing
with Deep Reinforcement Learning

@ Conference on Robot Learning (CoRL) 2023
Kevin Zakka1
Philipp Wu1
Laura Smith1
Nimrod Gileadi2
Taylor Howell3
Xue Bin Peng4
Sumeet Singh2
Yuval Tassa2
Pete Florence2
Andy Zeng2
Pieter Abbeel1
1UC Berkeley
2Google DeepMind
3Stanford University
4Simon Fraser University

TLDR We train anthropomorphic robot hands to play the piano using deep RL
and release a simulated benchmark and dataset to advance high-dimensional control.

Interactive Demo

This is a demo of our simulated piano playing agent trained with reinforcement learning. It runs MuJoCo natively in your browser thanks to WebAssembly. You can use your mouse to interact with it, for example by dragging down the piano keys to generate sound or pushing the hands to perturb them. The controls section in the top right corner can be used to change songs and the simulation section to pause or reset the agent. Make sure you click the demo at least once to enable sound.



We build our simulated piano-playing environment using the open-source MuJoCo physics engine. It consists in a full-size 88-key digital keyboard and two Shadow Dexterous Hands, each with 24 degrees of freedom.

Musical representation

We use the Musical Instrument Digital Interface (MIDI) standard to represent a musical piece as a sequence of time-stamped messages corresponding to "note-on" or "note-off" events. A message carries additional pieces of information such as the pitch of a note and its velocity.

We convert the MIDI file into a time-indexed note trajectory (also known as a piano roll), where each note is represented as a one-hot vector of length 88 (the number of keys on a piano). This trajectory is used as the goal representation for our agent, informing it which keys to press at each time step.

The interactive plot below shows the song Twinkle Twinkle Little Star encoded as a piano roll. The x-axis represents time in seconds, and the y-axis represents musical pitch as a number between 1 and 88. You can hover over each note to see what additional information it carries.

A synthesizer can be used to convert MIDI files to raw audio:

Musical evaluation

We use precision, recall and F1 scores to evaluate the proficiency of our agent. If at a given instance of time there are keys that should be "on" and keys that should be "off", precision measures how good the agent is at not hitting any of the keys that should be "off", while recall measures how good the agent is at hitting all the keys that should be "on". The F1 score combines the precision and recall into a single metric, and ranges from 0 (if either precision or recall is 0) to 1 (perfect precision and recall).

Piano fingering and dataset

Piano fingering refers to the assignment of fingers to notes in a piano piece (see figure below). Sheet music will typically provide sparse fingering labels for the tricky sections of a piece to help guide pianists, and pianists will often develop their own fingering preferences for a given piece.

In RoboPianist, we found that the agent struggled to learn to play the piano with a sparse reward signal due to the exploration challenge associated with the high-dimensional action space. To overcome this issue, we added human priors in the form of the fingering labels to the reward function to guide its exploration.

Since fingering labels aren't available in MIDI files by default, we used annotations from the Piano Fingering Dataset (PIG) to create 150 labeled MIDI files, which we call Repertoire-150 and release as part of our environment.

Finger numbers (1 to 9) annotated above each note. Source: PianoPlayer

MDP Formulation

We model piano-playing as a finite-horizon Markov Decision Process (MDP) defined by a tuple \( (\mathcal{S}, \mathcal{A}, \mathcal{\rho}, \mathcal{p}, r, \gamma, H) \), where \( \mathcal{S} \) is the state space, \( \mathcal{A} \) is the action space, \( \mathcal{\rho}(\cdot) \) is the initial state distribution, \( \mathcal{p} (\cdot | s, a) \) governs the dynamics, \( r(s, a) \) is the reward function, \( \gamma \) is the discount factor, and \( H \) is the horizon. The goal of the agent is to maximize its total expected discounted reward over the horizon \( \mathbb{E}\left[\sum_{t=0}^{H} \gamma^t r(s_t, a_t) \right] \).

At every time step, the agent receives proprioceptive (i.e, hand joint angles), exteroceptive (i.e., piano key states) and goal observations (i.e., piano roll) and outputs 22 target joint angles for each hand. These are fed to proportional-position actuators which convert them to torques at each joint. The agent then receives a weighted sum of reward terms, including a reward for hitting the correct keys, a reward for minimizing energy consumption, and a shaping reward for adhering to the fingering labels.

For our policy optimizer, we use a state-of-the-art model-free RL algorithm DroQ and train our agent for 5 million steps with a control frequency of 20 Hz.

Quantitative Results

With careful system design, we improve our agent's performance significantly. Specifically, adding an energy cost to the reward formulation, providing a few seconds worth of future goals rather than just the current goal, and constraining the action space helped the agent learn faster and achieve a higher F1 score. The plot below shows the additive effect of each of these design choices on three different songs of increasing difficulty.

When compared to a strong derivative-free model predictive control (MPC) baseline, Predictive Sampling, our agent achieves a much higher F1 score, averaging 0.79 across Etude-12 versus 0.43 for Predictive Sampling.

Qualitative Results

Each video below is playing real-time and shows our agent playing every song in the Etude-12 subset. In each video frame, we display the fingering labels by coloring the keys according to the corresponding finger color. When a key is pressed, it is colored green.

Debug dataset

This dataset contains "entry-level" songs (e.g., scales) and is useful for sanity checking an agent's performance. Fingering labels in this dataset were manually annotated by the authors of this paper. It is not part of the Repertoire-150 dataset.

C Major Scale
D Major Scale
Twinkle Twinkle Little Star

Etude-12 subset

Etude-12 is a subset of the full 150-large dataset and consists of 12 songs of varying difficulty. It is a subset of the full benchmark reserved for more moderate compute budgets.

Piano Sonata D845 1st Mov (F1=0.72)
Partita No. 2 6th Mov (F1=0.73)
Bagatelle Op. 3 No. 4 (F1=0.75)
French Suite No. 5 Sarabande (F1=0.89)
Waltz Op. 64 No. 1 (F1=0.78)
French Suite No. 1 Allemande (F1=0.78)
Piano Sonata No. 2 1st Mov (F1=0.79)
Kreisleriana Op. 16 No. 8 (F1=0.84)
Golliwoggs Cakewalk (F1=0.85)
Piano Sonata No. 23 2nd Mov (F1=0.87)
French Suite No. 5 Gavotte (F1=0.77)
Piano Sonata K279 1st Mov (F1=0.78)

Common failure modes

Since the Shadow Hand forearms are thicker than a human's, the agent sometimes struggles to nail down notes that are really close together. Adding full rotational and translational degrees of freedom to the hands could give them the ability to overcome this limitation, but would pose additional challenges for learning.
The agent struggles with songs that require stretching the fingers over many notes, sometimes more than 1 octave.


This work is supported in part by ONR #N00014-22-1-2121 under the Science of Autonomy program.

This website was heavily inspired by Brent Yi's.