EnzyGen2

Protein foundation model for de novo enzyme design that co-designs sequence and 3D structure under small-molecule ligand guidance, at 730M parameters.

Released: March 2026

Parameters: 730 Million

EnzyGen2 is a generative protein foundation model that addresses one of the central challenges in computational enzyme engineering: designing both the amino acid sequence and the three-dimensional structure of a functional enzyme in a single, coordinated step, conditioned on the small-molecule ligand the enzyme is meant to act on. Most de novo design pipelines split this problem in two — a structure-generation model (e.g., RFdiffusion) proposes a backbone, and a separate inverse-folding model (e.g., ProteinMPNN or LigandMPNN) assigns a sequence to it. EnzyGen2 instead jointly models sequence and structure, allowing the two to be optimized together under explicit ligand-guided functional targeting.

Developed by Zhenqiao Song, Huichong Liu, Yunlong Zhao, Yang Yang, and Lei Li at Carnegie Mellon University and released as a bioRxiv preprint in March 2026, EnzyGen2 is the successor to EnzyGen (ICML 2024), which introduced unified sequence-and-structure enzyme generation guided by functionally important sites and substrates. EnzyGen2 scales this idea into a 730M-parameter foundation model with a multi-task training regime and demonstrates that its designs are not just plausible in silico but catalytically active in the wet lab.

The model sits at the intersection of protein design and small-molecule recognition, making it relevant to enzyme engineering, biocatalysis, and synthetic biology workflows where a target reaction or substrate is known but a suitable catalyst is not.

Key Features

Joint sequence-structure co-design: Rather than chaining a backbone generator to a separate inverse-folding step, EnzyGen2 generates amino acid sequence and 3D coordinates simultaneously, so the two are mutually consistent and jointly conditioned on function.
Ligand-guided functional targeting: Designs are conditioned on the small-molecule substrate/ligand, steering generation toward enzymes capable of acting on a specified chemical target.
Multi-task training objectives: Training combines sequence prediction, structure reconstruction, and protein-ligand interaction prediction, encouraging representations that capture both fold and binding.
Large-scale speedup: The authors report generating samples roughly 400x faster than prior diffusion-plus-inverse-folding pipelines, lowering the cost of large design campaigns.
Experimentally validated activity: Family-specific EnzyGen2 designs were synthesized and assayed, with measured catalytic activity comparable to or exceeding natural enzymes while remaining sequence-divergent (identities as low as 51.6%).

Technical Details

EnzyGen2 is a 730M-parameter model trained on 720,993 protein-ligand pairs using a three-stage pretraining curriculum: masked sequence/structure modeling (20% of residues masked), motif-conditioned training, and finally a full objective that adds protein-ligand interaction prediction on top of sequence and structure losses. After pretraining, the model is fine-tuned on specific enzyme families for targeted design. In silico, the authors report that EnzyGen2 outperforms three generations of the dominant two-stage baseline — RFdiffusion/ProteinMPNN, RFdiffusion2/LigandMPNN, and RFdiffusion3/LigandMPNN — while being substantially faster. The strongest evidence is experimental: designs for chloramphenicol acetyltransferase (ChlR), aminoglycoside adenylyltransferase (AadA), and thiopurine S-methyltransferase (TPMT) were expressed and shown to be catalytically active, with sequence identities to natural enzymes as low as 51.6%, indicating genuine novelty rather than memorization of training proteins.

Applications

EnzyGen2 is aimed at enzyme engineers, biocatalysis researchers, and synthetic biologists who need a catalyst for a defined substrate or reaction. Because generation is conditioned on a target ligand and produces both sequence and structure, it can propose candidate enzymes for downstream expression and assay without requiring a pre-existing natural scaffold. The released codebase supports both pretrained, family-agnostic generation and family-specific fine-tuning, making it usable for exploratory design across enzyme classes as well as focused optimization within a single family such as acetyltransferases or methyltransferases.

Impact

EnzyGen2 is notable for closing the loop from generative model to demonstrated wet-lab catalytic activity — a bar that many de novo enzyme design methods have not cleared — while arguing that a single jointly trained model can match or beat the widely used diffusion-plus-inverse-folding stack at a fraction of the computational cost. As a preprint, its claims await peer review and independent reproduction, and validation so far covers a small number of enzyme families. The authors release training, fine-tuning, and evaluation data on Zenodo and code with pretrained and fine-tuned checkpoints (distributed via Google Drive) under an MIT license; the repository README documents usage but does not constitute a formal model card, and no Hugging Face deployment was available at release.

Citation

Co-designing sequence and structure of functional de novo enzymes with EnzyGen2

Song, Z., et al. (2026) Co-designing sequence and structure of functional de novo enzymes with EnzyGen2. bioRxiv.

DOI: 10.64898/2026.03.02.709205

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References80

GitHub

Stars30

Forks5

Open Issues0

Contributors1

Last Push3mo ago

LanguagePython

LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

89Open

Usability — can I run it?95

Reproducibility — can I retrain it?87

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Dataset

Key Features

Joint sequence-structure co-design: Rather than chaining a backbone generator to a separate inverse-folding step, EnzyGen2 generates amino acid sequence and 3D coordinates simultaneously, so the two are mutually consistent and jointly conditioned on function.

Ligand-guided functional targeting: Designs are conditioned on the small-molecule substrate/ligand, steering generation toward enzymes capable of acting on a specified chemical target.

Multi-task training objectives: Training combines sequence prediction, structure reconstruction, and protein-ligand interaction prediction, encouraging representations that capture both fold and binding.

Large-scale speedup: The authors report generating samples roughly 400x faster than prior diffusion-plus-inverse-folding pipelines, lowering the cost of large design campaigns.

Experimentally validated activity: Family-specific EnzyGen2 designs were synthesized and assayed, with measured catalytic activity comparable to or exceeding natural enzymes while remaining sequence-divergent (identities as low as 51.6%).

Technical Details

Applications

Impact

EnzyGen2

Key Features

Technical Details

Applications

Impact

Citation

Co-designing sequence and structure of functional de novo enzymes with EnzyGen2

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

EnzyGen2

Key Features

Technical Details

Applications

Impact

Citation

Co-designing sequence and structure of functional de novo enzymes with EnzyGen2

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

EnzyGen2

#Key Features

#Technical Details

#Applications

#Impact

Citation

Co-designing sequence and structure of functional de novo enzymes with EnzyGen2

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

EnzyGen2

#Key Features

#Technical Details

#Applications

#Impact

Citation

Co-designing sequence and structure of functional de novo enzymes with EnzyGen2

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact