Overview

ProGen2 is a family of autoregressive protein language models developed by Salesforce Research and published in Cell Systems in October 2023. The work systematically scales transformer-based protein language models from 151 million to 6.4 billion parameters, training each variant on different combinations of sequence databases drawn from over one billion proteins sourced from genomic, metagenomic, and immune repertoire repositories. The central aim is to characterize how model scale and training data composition jointly affect the two most practically important capabilities of a protein language model: generating novel, plausible sequences and predicting the fitness effects of mutations without task-specific fine-tuning.

ProGen2 is the successor to the original ProGen model, which demonstrated that autoregressive language models could generate functional lysozyme sequences with catalytic efficiencies comparable to natural proteins, even at sequence identities as low as 31.4% relative to known proteins. ProGen2 expands on that foundation by conducting a rigorous scaling study across five model sizes, enabling direct comparisons with masked language models such as the ESM family and revealing important non-linearities between model scale and downstream task performance.

The models and code are released openly on GitHub, making the full ProGen2 suite accessible to the research community for protein engineering, fitness landscape exploration, and generative protein design.

Key Features

Model family spanning four orders of magnitude: Five model variants — ProGen2-small (151M), ProGen2-medium (764M), ProGen2-base (764M, alternative data mixture), ProGen2-large (2.7B), and ProGen2-xlarge (6.4B) — allow researchers to balance computational cost against capability for their specific task.
Autoregressive generation: Unlike masked language models, ProGen2 models generate protein sequences token-by-token from left to right, making them natively suited for de novo sequence design without requiring inpainting or iterative denoising.
Zero-shot fitness prediction: ProGen2 models can score the functional plausibility of any protein sequence or mutation using log-likelihood without any supervised fine-tuning, achieving competitive performance against other leading protein language models on fitness prediction benchmarks.
Broad training data coverage: Models are trained on mixtures of UniRef90 (cluster representatives at 90% sequence identity) and BFD30 (metagenomic sequences at 30% identity), together spanning over one billion sequences and representing broad sequence diversity including poorly characterized microbial proteins.
Structure-validated generation quality: Sequences sampled from ProGen2-xlarge were assessed using AlphaFold2 structure prediction, with 5,000 generated sequences achieving a median TM-score of 0.89 against the closest PDB structure and a median pLDDT of 73.7, indicating that a large majority of generated sequences adopt plausible, well-folded three-dimensional structures.
Scaling insights for the field: The study reveals that perplexity improves monotonically with scale, but zero-shot fitness prediction peaks at 764M parameters and degrades in larger models — a counter-intuitive finding with practical implications for choosing model size in protein engineering applications.

Technical Details

ProGen2 models are decoder-only transformers trained with a causal (next-token prediction) language modeling objective over amino acid sequences. The smallest variant, ProGen2-small (151M parameters, 12 layers), and the largest, ProGen2-xlarge (6.4B parameters, 32 layers), span the range explored in the study. Context lengths range from 1,024 to 2,048 tokens, accommodating the majority of known protein sequences. The training corpus combines UniRef90 — representative sequences from UniProtKB clustered at 90% sequence identity — with BFD30, a metagenomic database clustered at 30% identity that adds substantial structural and functional diversity beyond what is captured by curated sequence databases alone.

A key finding from the scaling analysis is the dissociation between perplexity and fitness prediction accuracy. Perplexity on held-out sequence sets decreases steadily as parameter count increases from 151M to 6.4B, consistent with standard language model scaling behavior. However, zero-shot fitness prediction performance (measured as Spearman correlation between log-likelihood scores and experimentally measured fitness values) peaks at 764M parameters and declines for the 2.7B and 6.4B variants. This suggests that very large autoregressive models may over-fit to the frequency distribution of evolutionary sequences in ways that impair their ability to discriminate between closely related functional variants. The authors also benchmarked ProGen2 against masked language models, finding that autoregressive models tend to perform favorably in zero-shot fitness prediction settings, consistent with results from the ProteinGym benchmark.

Applications

ProGen2 is applicable across the full spectrum of protein engineering workflows. Researchers can use the autoregressive generation capability to sample novel sequences from any protein family of interest, either unconditionally or conditioned on partial sequences representing known functional motifs. The zero-shot fitness scoring capability allows rapid in silico screening of variant libraries — for instance, prioritizing mutations for experimental testing without requiring labeled training data. The structural plausibility of ProGen2-generated sequences, validated via AlphaFold2 predictions, makes the model suitable as a first-pass generative tool for enzyme design, antibody engineering, and the exploration of sequence space in understudied protein families. The availability of multiple model sizes also makes ProGen2 practical in resource-constrained environments where the smallest variants can run on a single GPU.

Impact

ProGen2 represents one of the most systematic scaling studies conducted for protein language models and established several results that have influenced subsequent model development. The counter-intuitive finding that fitness prediction performance peaks below the largest model size has been widely cited in discussions of how to select and evaluate protein foundation models. The work helped define the experimental framework that benchmarks such as ProteinGym subsequently operationalized at larger scale. The open release of all five model weights through the Salesforce GitHub repository has made ProGen2 a widely used baseline and generative backbone in the protein design community. A practical limitation of the model family is that, as pure sequence models, ProGen2 variants operate without explicit structural information and cannot directly incorporate three-dimensional constraints into generation or scoring — a gap addressed by subsequent hybrid sequence-structure models.

Overview

Key Features

Model family spanning four orders of magnitude: Five model variants — ProGen2-small (151M), ProGen2-medium (764M), ProGen2-base (764M, alternative data mixture), ProGen2-large (2.7B), and ProGen2-xlarge (6.4B) — allow researchers to balance computational cost against capability for their specific task.

Autoregressive generation: Unlike masked language models, ProGen2 models generate protein sequences token-by-token from left to right, making them natively suited for de novo sequence design without requiring inpainting or iterative denoising.

Zero-shot fitness prediction: ProGen2 models can score the functional plausibility of any protein sequence or mutation using log-likelihood without any supervised fine-tuning, achieving competitive performance against other leading protein language models on fitness prediction benchmarks.

Broad training data coverage: Models are trained on mixtures of UniRef90 (cluster representatives at 90% sequence identity) and BFD30 (metagenomic sequences at 30% identity), together spanning over one billion sequences and representing broad sequence diversity including poorly characterized microbial proteins.

Structure-validated generation quality: Sequences sampled from ProGen2-xlarge were assessed using AlphaFold2 structure prediction, with 5,000 generated sequences achieving a median TM-score of 0.89 against the closest PDB structure and a median pLDDT of 73.7, indicating that a large majority of generated sequences adopt plausible, well-folded three-dimensional structures.

Scaling insights for the field: The study reveals that perplexity improves monotonically with scale, but zero-shot fitness prediction peaks at 764M parameters and degrades in larger models — a counter-intuitive finding with practical implications for choosing model size in protein engineering applications.

Technical Details

Applications

Impact

ProGen2

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources

ProGen2

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources