A 730M-parameter protein foundation model that co-designs enzyme sequence and 3D structure under small-molecule ligand guidance for de novo enzyme design.
EnzyGen2 is a generative protein foundation model that addresses one of the central challenges in computational enzyme engineering: designing both the amino acid sequence and the three-dimensional structure of a functional enzyme in a single, coordinated step, conditioned on the small-molecule ligand the enzyme is meant to act on. Most de novo design pipelines split this problem in two — a structure-generation model (e.g., RFdiffusion) proposes a backbone, and a separate inverse-folding model (e.g., ProteinMPNN or LigandMPNN) assigns a sequence to it. EnzyGen2 instead jointly models sequence and structure, allowing the two to be optimized together under explicit ligand-guided functional targeting.
Developed by Zhenqiao Song, Huichong Liu, Yunlong Zhao, Yang Yang, and Lei Li at Carnegie Mellon University and released as a bioRxiv preprint in March 2026, EnzyGen2 is the successor to EnzyGen (ICML 2024), which introduced unified sequence-and-structure enzyme generation guided by functionally important sites and substrates. EnzyGen2 scales this idea into a 730M-parameter foundation model with a multi-task training regime and demonstrates that its designs are not just plausible in silico but catalytically active in the wet lab.
The model sits at the intersection of protein design and small-molecule recognition, making it relevant to enzyme engineering, biocatalysis, and synthetic biology workflows where a target reaction or substrate is known but a suitable catalyst is not.
EnzyGen2 is a 730M-parameter model trained on 720,993 protein-ligand pairs using a three-stage pretraining curriculum: masked sequence/structure modeling (20% of residues masked), motif-conditioned training, and finally a full objective that adds protein-ligand interaction prediction on top of sequence and structure losses. After pretraining, the model is fine-tuned on specific enzyme families for targeted design. In silico, the authors report that EnzyGen2 outperforms three generations of the dominant two-stage baseline — RFdiffusion/ProteinMPNN, RFdiffusion2/LigandMPNN, and RFdiffusion3/LigandMPNN — while being substantially faster. The strongest evidence is experimental: designs for chloramphenicol acetyltransferase (ChlR), aminoglycoside adenylyltransferase (AadA), and thiopurine S-methyltransferase (TPMT) were expressed and shown to be catalytically active, with sequence identities to natural enzymes as low as 51.6%, indicating genuine novelty rather than memorization of training proteins.
EnzyGen2 is aimed at enzyme engineers, biocatalysis researchers, and synthetic biologists who need a catalyst for a defined substrate or reaction. Because generation is conditioned on a target ligand and produces both sequence and structure, it can propose candidate enzymes for downstream expression and assay without requiring a pre-existing natural scaffold. The released codebase supports both pretrained, family-agnostic generation and family-specific fine-tuning, making it usable for exploratory design across enzyme classes as well as focused optimization within a single family such as acetyltransferases or methyltransferases.
EnzyGen2 is notable for closing the loop from generative model to demonstrated wet-lab catalytic activity — a bar that many de novo enzyme design methods have not cleared — while arguing that a single jointly trained model can match or beat the widely used diffusion-plus-inverse-folding stack at a fraction of the computational cost. As a preprint, its claims await peer review and independent reproduction, and validation so far covers a small number of enzyme families. The authors release training, fine-tuning, and evaluation data on Zenodo and code with pretrained and fine-tuned checkpoints (distributed via Google Drive) under an MIT license; the repository README documents usage but does not constitute a formal model card, and no Hugging Face deployment was available at release.