bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

LigandMPNN

Institute for Protein Design

Protein sequence design method that explicitly models small molecules, nucleotides, and metals at atomic resolution, enabling ligand-aware design with 100+ validated designs.

Released: 2025
Parameters: 2,620,000

Overview

LigandMPNN is a deep learning-based protein sequence design method developed by Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, and colleagues at the University of Washington Institute for Protein Design, published in Nature Methods in March 2025. It extends ProteinMPNN to explicitly represent all non-protein atoms — small molecules, nucleotides, and metals — at atomic resolution during sequence design, closing a fundamental gap in earlier methods that treated proteins as isolated chains and ignored the chemical context of bound cofactors, substrates, and ions.

The central innovation is a multi-graph architecture that jointly encodes protein backbone geometry and the three-dimensional arrangement of non-protein atoms, allowing the model to predict amino acid sequences that are chemically compatible with a specific ligand environment. On benchmarks against native PDB structures, LigandMPNN achieves native sequence recovery of 63.3% for small-molecule binding residues, compared to 50.4–50.5% for ProteinMPNN and Rosetta. For metal-coordinating residues the improvement is even more dramatic: 77.5% versus 36.0–40.6%. More than 100 designs have been experimentally characterized, with results confirming correct folding and intended binding activity.

LigandMPNN fills a practical void in structure-based protein design pipelines: virtually all biochemically interesting proteins interact with non-protein partners, yet previous sequence design tools lacked the machinery to exploit that chemical context. The model is the standard sequence design step in Baker Lab all-atom design workflows, paired with RFdiffusion All-Atom for backbone generation and AlphaFold 2 or ESMFold for computational verification.

Key Features

  • Explicit non-protein atom modeling: Represents small molecules, nucleotides, and metal ions as atomic graph nodes, providing the decoder with direct chemical context during sequence prediction rather than treating the ligand as invisible.
  • Three-graph architecture: Separate encoders for the protein backbone, intraligand chemistry, and protein-ligand interactions are fused before autoregressive decoding, allowing each signal to be captured independently then combined.
  • Integrated side-chain packing: A companion neural network predicts all four side-chain torsion angles given a designed sequence and backbone, enabling immediate evaluation of binding geometry without a separate packing step.
  • Broad ligand generality: Handles arbitrary small molecules, nucleotide cofactors (DNA, RNA, NAD, ATP, and others), and transition metals within a single unified model.
  • Lightweight and fast: At 2.62 million parameters, the model designs a 100-residue protein in approximately 0.9 seconds on a single CPU, making high-throughput design campaigns practical.
  • Open-source: Weights and inference code are freely available and accept standard PDB files as input, with no specialized hardware required.

Technical Details

LigandMPNN operates on three coupled graphs. The protein backbone graph represents residues as nodes with edges encoding the 25 inter-residue distances between backbone heavy atoms within a spatial cutoff, identical in form to ProteinMPNN. The intraligand graph represents ligand atoms as nodes with edges encoding pairwise distances and chemical element types, capturing internal ligand geometry and chemical identity. The protein-ligand interaction graph is a bipartite graph connecting residue nodes to ligand atom nodes, with edges encoding distance and relative orientation — the key addition over ProteinMPNN. A shared backbone encoder and a protein-ligand encoder produce fused residue-level representations, which an autoregressive decoder uses to sample amino acid identities one residue at a time.

The model was trained on protein assemblies from a December 2022 PDB snapshot filtered to X-ray and cryo-EM structures at 3.5 Å resolution or better, with chains under 6,000 residues, and sequences clustered at 30% identity using MMseqs2. A data augmentation strategy randomly provided 2–4% of protein side-chain atoms as additional ligand context during training, improving generalization to novel chemistries. Benchmark test sets comprised 317 protein-small molecule structures, 74 protein-nucleic acid structures, and 83 metal-coordinating structures held out from training. LigandMPNN outperforms ProteinMPNN and Rosetta across all three ligand classes, with the largest gains for metal coordination where precise geometry and element-specific contacts are critical.

Applications

LigandMPNN is the method of choice whenever a design objective involves a non-protein component. Enzyme designers use it to engineer active-site residues around substrate or transition-state analog scaffolds, and it has been applied extensively within the Baker Lab for de novo enzyme creation. Small-molecule binder and biosensor projects use it in combination with RFdiffusion All-Atom, where RFdiffusion generates the backbone geometry and LigandMPNN designs the sequence to match. Researchers engineering nucleotide-binding proteins — sequence-specific DNA or RNA binders, NAD/FAD-binding domains, and nucleotide-gated sensors — benefit from its explicit treatment of nucleotide atoms. Metal-coordinating protein design, whether for catalysis, structural stabilization, or metal-ion biosensing, is supported natively. In all these contexts, LigandMPNN serves as a drop-in replacement for ProteinMPNN wherever a non-protein partner is present in the input structure.

Impact

LigandMPNN represents a meaningful advance in computational protein design by making ligand-aware sequence design accessible to the broader community in a fast, open-source package. Its publication in Nature Methods and experimental validation across more than 100 designs give it strong credibility, and it has been adopted as a standard component of Baker Lab all-atom design pipelines alongside RFdiffusion All-Atom. Key limitations include the fixed-backbone assumption — backbone coordinates must be supplied by a separate tool — and the fixed-ligand-pose assumption, which does not model induced-fit or conformational flexibility. The model also does not directly predict binding affinity; high sequence recovery near a ligand is a proxy for chemical compatibility rather than a thermodynamic quantity. Performance may be reduced for highly unusual chemistries underrepresented in PDB training data, and covalent ligand attachments are not explicitly modeled. Within these bounds, LigandMPNN has lowered the barrier to designing proteins that interact with the small-molecule and nucleotide partners that drive most of biology.

Citation

Atomic context-conditioned protein sequence design using LigandMPNN

Dauparas, J., et al. (2023) Atomic context-conditioned protein sequence design using LigandMPNN. bioRxiv.

DOI: 10.1038/s41592-025-02626-1

Metrics

GitHub

Stars567
Forks133
Open Issues45
Contributors3
Last Push1y ago
LanguagePython
LicenseMIT

Citations

Total Citations183
Influential18
References50

Tags

enzyme designligand bindingprotein designsequence designgraph neural network

Resources

GitHub RepositoryResearch Paper