Overview

ProteinMPNN is a message passing neural network for protein sequence design developed in the Baker Lab at the University of Washington Institute for Protein Design. Given a fixed protein backbone, it predicts amino acid sequences that will fold into that structure — the central computational challenge in rational protein engineering. Published in Science in September 2022, it rapidly became the dominant sequence design tool in the field and is now a standard component of nearly every de novo protein design pipeline.

The model addresses a long-standing bottleneck: earlier energy-based methods such as Rosetta were slow and achieved modest sequence recovery, limiting throughput in design campaigns. ProteinMPNN reframes the problem as graph-based deep learning on backbone geometry, enabling fast inference and substantially higher accuracy. Its 52.4% native sequence recovery on held-out PDB structures represents a roughly 60% relative improvement over Rosetta's 32.9%, and sequences it produces show strong self-consistency when re-predicted by AlphaFold2.

ProteinMPNN was developed by Justas Dauparas and colleagues in the David Baker laboratory and is distributed as open-source software under the MIT license. A successor model, LigandMPNN, was released in 2025 with explicit support for small molecules, nucleotides, and metal ions.

Key Features

High Sequence Recovery: Achieves 52.4% native sequence recovery on held-out CATH test structures, compared to 32.9% for Rosetta — a roughly 60% relative improvement that translates directly to higher experimental success rates.
Multi-Chain and Oligomer Support: Natively handles symmetric and hetero-oligomeric assemblies using chain identity encoding and relative positional features capped at ±32 residues, enabling design of complex protein architectures.
Backbone Noise Augmentation: Trained with Gaussian noise on backbone coordinates at multiple levels (0.02–0.30 Å), producing sequences that are robust to the imperfect backbones generated by diffusion-based backbone design tools.
Flexible Partial Design: Specific residues or chains can be fixed while others are sampled, enabling interface redesign, loop optimization, or conservation of catalytic residues in an otherwise redesigned scaffold.
CA-Only Mode: Accepts Cα-only backbone inputs for compatibility with diffusion-generated backbones that lack full-atom coordinates, with a modest accuracy tradeoff relative to the full-backbone mode.
Fast Inference: Generates diverse sequence candidates in seconds on a single GPU, making high-throughput design campaigns with thousands of backbone variants practical.

Technical Details

ProteinMPNN is an encoder-decoder message passing neural network that operates on protein backbone graphs. Each residue is a node; edges connect each residue to its 32–48 nearest Cα neighbors in 3D space. Edge features encode pairwise distances between backbone atoms (N, Cα, C, O, and a virtual Cβ), relative orientations, and dihedral angles. Three encoder message-passing layers aggregate structural context into node and edge embeddings; three autoregressive decoder layers then generate the sequence one residue at a time, conditioning on structural embeddings and all previously placed amino acids. Decoding order is randomized during training to prevent positional bias. The hidden dimensionality throughout is 128.

The model was trained on a 16.5 GB curated set of PDB biological assemblies (August 2021 snapshot), filtered for structural redundancy across training, validation, and test splits. Four noise-level variants are released (v_48_002, v_48_010, v_48_020, v_48_030); the v_48_020 variant is typically recommended when sequencing diffusion-generated backbones. Pre-trained weights for all variants, including Cα-only models, are included in the repository. Experimental validation in the original paper included crystal and cryo-EM structures of 10 cyclic homo-oligomers (130–1800 amino acids) that matched design targets, with 88% of designed cyclic homo-oligomers confirmed soluble.

Applications

ProteinMPNN is routinely paired with backbone generation methods — primarily RFdiffusion and Chroma — to form complete de novo protein design pipelines in which a generative model proposes a backbone and ProteinMPNN sequences it. This combination has produced functional binding proteins, enzymes, and symmetric nanoparticle assemblies with experimental success rates far exceeding traditional methods. In stability and expression optimization workflows, ProteinMPNN is applied to existing structures to generate variants with improved thermostability or solubility, as demonstrated for myoglobin and TEV protease in a 2024 JACS study. For target-binding applications, the partial design mode fixes scaffold residues and redesigns only interface positions, enabling rapid exploration of binding contacts. ProteinMPNN is also integrated into cloud inference platforms such as NVIDIA BioNeMo NIM and third-party web servers including Neurosnap, broadening access beyond users with local GPU resources.

Impact

ProteinMPNN has become the de facto standard for sequence design in computational protein engineering within two years of its release, appearing as a component in the vast majority of published de novo design studies. The Science paper has accumulated thousands of citations, and the GitHub repository has been broadly adopted across academic and industrial settings. Its release directly enabled the practical deployment of diffusion-based backbone design by providing a fast, accurate method to translate geometric scaffolds into designable sequences. Notable limitations include the lack of small-molecule or cofactor awareness in the base model (addressed by the successor LigandMPNN), dependence on backbone quality as the primary determinant of experimental success, and the absence of evolutionary constraint information from multiple sequence alignments. Despite these constraints, ProteinMPNN's combination of accuracy, speed, flexibility, and open availability has made it a foundational tool in the modern protein design stack.

Overview

Key Features

High Sequence Recovery: Achieves 52.4% native sequence recovery on held-out CATH test structures, compared to 32.9% for Rosetta — a roughly 60% relative improvement that translates directly to higher experimental success rates.

Multi-Chain and Oligomer Support: Natively handles symmetric and hetero-oligomeric assemblies using chain identity encoding and relative positional features capped at ±32 residues, enabling design of complex protein architectures.

Backbone Noise Augmentation: Trained with Gaussian noise on backbone coordinates at multiple levels (0.02–0.30 Å), producing sequences that are robust to the imperfect backbones generated by diffusion-based backbone design tools.

Flexible Partial Design: Specific residues or chains can be fixed while others are sampled, enabling interface redesign, loop optimization, or conservation of catalytic residues in an otherwise redesigned scaffold.

CA-Only Mode: Accepts Cα-only backbone inputs for compatibility with diffusion-generated backbones that lack full-atom coordinates, with a modest accuracy tradeoff relative to the full-backbone mode.

Fast Inference: Generates diverse sequence candidates in seconds on a single GPU, making high-throughput design campaigns with thousands of backbone variants practical.

Technical Details

Applications

Impact

ProteinMPNN

Overview

Key Features

Technical Details

Applications

Impact

Citation

Robust deep learning based protein sequence design using ProteinMPNN

Metrics

GitHub

Citations

Tags

Resources

ProteinMPNN

Overview

Key Features

Technical Details

Applications

Impact

Citation

Robust deep learning based protein sequence design using ProteinMPNN

Metrics

GitHub

Citations

Tags

Resources