TLDR
We train anthropomorphic robot hands to play the piano using deep RL
and release a simulated benchmark and dataset to advance high-dimensional control.
This is a demo of our simulated piano playing agent trained with reinforcement learning. It runs MuJoCo natively in your browser thanks to WebAssembly. You can use your mouse to interact with it, for example by dragging down the piano keys to generate sound or pushing the hands to perturb them. The controls section in the top right corner can be used to change songs and the simulation section to pause or reset the agent. Make sure you click the demo at least once to enable sound.
We build our simulated piano-playing environment using the open-source MuJoCo physics engine. It consists in a full-size 88-key digital keyboard and two Shadow Dexterous Hands, each with 24 degrees of freedom.
We use the Musical Instrument Digital Interface (MIDI) standard to represent a musical piece as a sequence of time-stamped messages corresponding to "note-on" or "note-off" events. A message carries additional pieces of information such as the pitch of a note and its velocity.
We convert the MIDI file into a time-indexed note trajectory (also known as a piano roll), where each note is represented as a one-hot vector of length 88 (the number of keys on a piano). This trajectory is used as the goal representation for our agent, informing it which keys to press at each time step.
The interactive plot below shows the song Twinkle Twinkle Little Star encoded as a piano roll. The x-axis represents time in seconds, and the y-axis represents musical pitch as a number between 1 and 88. You can hover over each note to see what additional information it carries.
A synthesizer can be used to convert MIDI files to raw audio:
We use precision, recall and F1 scores to evaluate the proficiency of our agent. If at a given instance of time there are keys that should be "on" and keys that should be "off", precision measures how good the agent is at not hitting any of the keys that should be "off", while recall measures how good the agent is at hitting all the keys that should be "on". The F1 score combines the precision and recall into a single metric, and ranges from 0 (if either precision or recall is 0) to 1 (perfect precision and recall).
Piano fingering refers to the assignment of fingers to notes in a piano piece (see figure below). Sheet music will typically provide sparse fingering labels for the tricky sections of a piece to help guide pianists, and pianists will often develop their own fingering preferences for a given piece.
In RoboPianist, we found that the agent struggled to learn to play the piano with a sparse reward signal due to the exploration challenge associated with the high-dimensional action space. To overcome this issue, we added human priors in the form of the fingering labels to the reward function to guide its exploration.
Since fingering labels aren't available in MIDI files by default, we used annotations from the Piano Fingering Dataset (PIG) to create 150 labeled MIDI files, which we call Repertoire-150 and release as part of our environment.
We model piano-playing as a finite-horizon Markov Decision Process (MDP) defined by a tuple \( (\mathcal{S}, \mathcal{A}, \mathcal{\rho}, \mathcal{p}, r, \gamma, H) \), where \( \mathcal{S} \) is the state space, \( \mathcal{A} \) is the action space, \( \mathcal{\rho}(\cdot) \) is the initial state distribution, \( \mathcal{p} (\cdot | s, a) \) governs the dynamics, \( r(s, a) \) is the reward function, \( \gamma \) is the discount factor, and \( H \) is the horizon. The goal of the agent is to maximize its total expected discounted reward over the horizon \( \mathbb{E}\left[\sum_{t=0}^{H} \gamma^t r(s_t, a_t) \right] \).
At every time step, the agent receives proprioceptive (i.e, hand joint angles), exteroceptive (i.e., piano key states) and goal observations (i.e., piano roll) and outputs 22 target joint angles for each hand. These are fed to proportional-position actuators which convert them to torques at each joint. The agent then receives a weighted sum of reward terms, including a reward for hitting the correct keys, a reward for minimizing energy consumption, and a shaping reward for adhering to the fingering labels.
For our policy optimizer, we use a state-of-the-art model-free RL algorithm DroQ and train our agent for 5 million steps with a control frequency of 20 Hz.
With careful system design, we improve our agent's performance significantly. Specifically, adding an energy cost to the reward formulation, providing a few seconds worth of future goals rather than just the current goal, and constraining the action space helped the agent learn faster and achieve a higher F1 score. The plot below shows the additive effect of each of these design choices on three different songs of increasing difficulty.
When compared to a strong derivative-free model predictive control (MPC) baseline, Predictive Sampling, our agent achieves a much higher F1 score, averaging 0.79 across Etude-12 versus 0.43 for Predictive Sampling.
Each video below is playing real-time and shows our agent playing every song in the Etude-12 subset. In each video frame, we display the fingering labels by coloring the keys according to the corresponding finger color. When a key is pressed, it is colored green.
This dataset contains "entry-level" songs (e.g., scales) and is useful for sanity checking an agent's performance. Fingering labels in this dataset were manually annotated by the authors of this paper. It is not part of the Repertoire-150 dataset.
Etude-12 is a subset of the full 150-large dataset and consists of 12 songs of varying difficulty. It is a subset of the full benchmark reserved for more moderate compute budgets.
This work is supported in part by ONR #N00014-22-1-2121 under the Science of Autonomy program.
This website was heavily inspired by Brent Yi's.