LigandMPNN

Protein sequence design model that represents small molecules, nucleotides, and metals at atomic resolution, enabling ligand-aware enzyme design.

Released: March 2025

Parameters: 2.6 Million

LigandMPNN is a deep learning-based protein sequence design method developed by Justas Dauparas, Gyu Rie Lee, Robert Pecoraro, and colleagues at the University of Washington Institute for Protein Design, published in Nature Methods in March 2025. It extends ProteinMPNN to explicitly represent all non-protein atoms — small molecules, nucleotides, and metals — at atomic resolution during sequence design, closing a fundamental gap in earlier methods that treated proteins as isolated chains and ignored the chemical context of bound cofactors, substrates, and ions.

The central innovation is a multi-graph architecture that jointly encodes protein backbone geometry and the three-dimensional arrangement of non-protein atoms, allowing the model to predict amino acid sequences that are chemically compatible with a specific ligand environment. On benchmarks against native PDB structures, LigandMPNN achieves native sequence recovery of 63.3% for small-molecule binding residues, compared to 50.4–50.5% for ProteinMPNN and Rosetta. For metal-coordinating residues the improvement is even more dramatic: 77.5% versus 36.0–40.6%. More than 100 designs have been experimentally characterized, with results confirming correct folding and intended binding activity.

LigandMPNN fills a practical void in structure-based protein design pipelines: virtually all biochemically interesting proteins interact with non-protein partners, yet previous sequence design tools lacked the machinery to exploit that chemical context. The model is the standard sequence design step in Baker Lab all-atom design workflows, paired with RFdiffusion All-Atom for backbone generation and AlphaFold 2 or ESMFold for computational verification.

Key Features

Explicit non-protein atom modeling: Represents small molecules, nucleotides, and metal ions as atomic graph nodes, providing the decoder with direct chemical context during sequence prediction rather than treating the ligand as invisible.
Three-graph architecture: Separate encoders for the protein backbone, intraligand chemistry, and protein-ligand interactions are fused before autoregressive decoding, allowing each signal to be captured independently then combined.
Integrated side-chain packing: A companion neural network predicts all four side-chain torsion angles given a designed sequence and backbone, enabling immediate evaluation of binding geometry without a separate packing step.
Broad ligand generality: Handles arbitrary small molecules, nucleotide cofactors (DNA, RNA, NAD, ATP, and others), and transition metals within a single unified model.
Lightweight and fast: At 2.62 million parameters, the model designs a 100-residue protein in approximately 0.9 seconds on a single CPU, making high-throughput design campaigns practical.
Open-source: Weights and inference code are freely available and accept standard PDB files as input, with no specialized hardware required.

Technical Details

LigandMPNN operates on three coupled graphs. The protein backbone graph represents residues as nodes with edges encoding the 25 inter-residue distances between backbone heavy atoms within a spatial cutoff, identical in form to ProteinMPNN. The intraligand graph represents ligand atoms as nodes with edges encoding pairwise distances and chemical element types, capturing internal ligand geometry and chemical identity. The protein-ligand interaction graph is a bipartite graph connecting residue nodes to ligand atom nodes, with edges encoding distance and relative orientation — the key addition over ProteinMPNN. A shared backbone encoder and a protein-ligand encoder produce fused residue-level representations, which an autoregressive decoder uses to sample amino acid identities one residue at a time.

The model was trained on protein assemblies from a December 2022 PDB snapshot filtered to X-ray and cryo-EM structures at 3.5 Å resolution or better, with chains under 6,000 residues, and sequences clustered at 30% identity using MMseqs2. A data augmentation strategy randomly provided 2–4% of protein side-chain atoms as additional ligand context during training, improving generalization to novel chemistries. Benchmark test sets comprised 317 protein-small molecule structures, 74 protein-nucleic acid structures, and 83 metal-coordinating structures held out from training. LigandMPNN outperforms ProteinMPNN and Rosetta across all three ligand classes, with the largest gains for metal coordination where precise geometry and element-specific contacts are critical.

Applications

LigandMPNN is the method of choice whenever a design objective involves a non-protein component. Enzyme designers use it to engineer active-site residues around substrate or transition-state analog scaffolds, and it has been applied extensively within the Baker Lab for de novo enzyme creation. Small-molecule binder and biosensor projects use it in combination with RFdiffusion All-Atom, where RFdiffusion generates the backbone geometry and LigandMPNN designs the sequence to match. Researchers engineering nucleotide-binding proteins — sequence-specific DNA or RNA binders, NAD/FAD-binding domains, and nucleotide-gated sensors — benefit from its explicit treatment of nucleotide atoms. Metal-coordinating protein design, whether for catalysis, structural stabilization, or metal-ion biosensing, is supported natively. In all these contexts, LigandMPNN serves as a drop-in replacement for ProteinMPNN wherever a non-protein partner is present in the input structure.

Impact

LigandMPNN represents a meaningful advance in computational protein design by making ligand-aware sequence design accessible to the broader community in a fast, open-source package. Its publication in Nature Methods and experimental validation across more than 100 designs give it strong credibility, and it has been adopted as a standard component of Baker Lab all-atom design pipelines alongside RFdiffusion All-Atom. Key limitations include the fixed-backbone assumption — backbone coordinates must be supplied by a separate tool — and the fixed-ligand-pose assumption, which does not model induced-fit or conformational flexibility. The model also does not directly predict binding affinity; high sequence recovery near a ligand is a proxy for chemical compatibility rather than a thermodynamic quantity. Performance may be reduced for highly unusual chemistries underrepresented in PDB training data, and covalent ligand attachments are not explicitly modeled. Within these bounds, LigandMPNN has lowered the barrier to designing proteins that interact with the small-molecule and nucleotide partners that drive most of biology.

Citation

Atomic context-conditioned protein sequence design using LigandMPNN

Dauparas, J., et al. (2023) Atomic context-conditioned protein sequence design using LigandMPNN. bioRxiv.

DOI: 10.1038/s41592-025-02626-1

Recent citations

Papers that recently cited this model.

De novo design of ligand binding and sensing with a physics based generative approach
Yanzhe Zhang, Yitao Ke, Rui Zhi, et al.
bioRxiv · Jul 2026
0
Applications and limitations of AI tools in enzyme design
Rosa Teijeiro-Juiz, Nina Egeler, Grzegorz Jamróg, et al.
Protein Science · Jul 2026
1
From First Principles to Function: How AI Is Reshaping Enzyme Design.
Sebastian Lindner, Florence J. Hardy, Donald Hilvert
Biochemistry · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Atom-level enzyme active site scaffolding using RFdiffusion2
Woody Ahern, Jason Yim, D. Tischer, et al.
bioRxiv · Apr 2025
89
PLINDER: The protein-ligand interactions dataset and evaluation resource
J. Durairaj, Yusuf Adeshina, Zhonglin Cao, et al.
bioRxiv · Jul 2024
73
Binding and sensing diverse small molecules using shape-complementary pseudocycles
Linna An, Meerit Y. Said, Long Tran, et al.
Science · Jul 2024
70Influential
Targeting protein–ligand neosurfaces with a generalizable deep learning tool
Anthony Marchand, Stephen Buckley, Arne Schneuing, et al.
Nature · Jan 2025
61
Design of highly functional genome editors by modelling CRISPR–Cas sequences
Jeffrey A. Ruffolo, Stephen Nayfach, Joseph Gallagher, et al.
Nature · Jul 2025
60

Citations

Total Citations232

Influential20

References49

GitHub

Stars608

Forks145

Open Issues48

Contributors3

Last Push1y ago

LanguagePython

LicenseMIT

Fields of citing research

Biology75%
Computer Science68%
Medicine62%
Chemistry45%
Engineering20%
Materials Science6%
Environmental Science6%
Agricultural and Food Sciences2%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

66Partial

Usability — can I run it?95

Reproducibility — can I retrain it?22

open weights, closed recipe

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Research Paper Official Website

Key Features

Explicit non-protein atom modeling: Represents small molecules, nucleotides, and metal ions as atomic graph nodes, providing the decoder with direct chemical context during sequence prediction rather than treating the ligand as invisible.

Three-graph architecture: Separate encoders for the protein backbone, intraligand chemistry, and protein-ligand interactions are fused before autoregressive decoding, allowing each signal to be captured independently then combined.

Integrated side-chain packing: A companion neural network predicts all four side-chain torsion angles given a designed sequence and backbone, enabling immediate evaluation of binding geometry without a separate packing step.

Broad ligand generality: Handles arbitrary small molecules, nucleotide cofactors (DNA, RNA, NAD, ATP, and others), and transition metals within a single unified model.

Lightweight and fast: At 2.62 million parameters, the model designs a 100-residue protein in approximately 0.9 seconds on a single CPU, making high-throughput design campaigns practical.

Open-source: Weights and inference code are freely available and accept standard PDB files as input, with no specialized hardware required.

Technical Details

Applications

Impact

LigandMPNN

#Key Features

#Technical Details

#Applications

#Impact

Citation

Atomic context-conditioned protein sequence design using LigandMPNN

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

LigandMPNN

#Key Features

#Technical Details

#Applications

#Impact

Citation

Atomic context-conditioned protein sequence design using LigandMPNN

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact