Family of autoregressive protein language models (151M–6.4B parameters) trained on over a billion sequences for protein generation and zero-shot fitness prediction.
ProGen2 is a family of autoregressive protein language models developed by Salesforce Research and published in Cell Systems in October 2023. The work systematically scales transformer-based protein language models from 151 million to 6.4 billion parameters, training each variant on different combinations of sequence databases drawn from over one billion proteins sourced from genomic, metagenomic, and immune repertoire repositories. The central aim is to characterize how model scale and training data composition jointly affect the two most practically important capabilities of a protein language model: generating novel, plausible sequences and predicting the fitness effects of mutations without task-specific fine-tuning.
ProGen2 is the successor to the original ProGen model, which demonstrated that autoregressive language models could generate functional lysozyme sequences with catalytic efficiencies comparable to natural proteins, even at sequence identities as low as 31.4% relative to known proteins. ProGen2 expands on that foundation by conducting a rigorous scaling study across five model sizes, enabling direct comparisons with masked language models such as the ESM family and revealing important non-linearities between model scale and downstream task performance.
The models and code are released openly on GitHub, making the full ProGen2 suite accessible to the research community for protein engineering, fitness landscape exploration, and generative protein design.
ProGen2 models are decoder-only transformers trained with a causal (next-token prediction) language modeling objective over amino acid sequences. The smallest variant, ProGen2-small (151M parameters, 12 layers), and the largest, ProGen2-xlarge (6.4B parameters, 32 layers), span the range explored in the study. Context lengths range from 1,024 to 2,048 tokens, accommodating the majority of known protein sequences. The training corpus combines UniRef90 — representative sequences from UniProtKB clustered at 90% sequence identity — with BFD30, a metagenomic database clustered at 30% identity that adds substantial structural and functional diversity beyond what is captured by curated sequence databases alone.
A key finding from the scaling analysis is the dissociation between perplexity and fitness prediction accuracy. Perplexity on held-out sequence sets decreases steadily as parameter count increases from 151M to 6.4B, consistent with standard language model scaling behavior. However, zero-shot fitness prediction performance (measured as Spearman correlation between log-likelihood scores and experimentally measured fitness values) peaks at 764M parameters and declines for the 2.7B and 6.4B variants. This suggests that very large autoregressive models may over-fit to the frequency distribution of evolutionary sequences in ways that impair their ability to discriminate between closely related functional variants. The authors also benchmarked ProGen2 against masked language models, finding that autoregressive models tend to perform favorably in zero-shot fitness prediction settings, consistent with results from the ProteinGym benchmark.
ProGen2 is applicable across the full spectrum of protein engineering workflows. Researchers can use the autoregressive generation capability to sample novel sequences from any protein family of interest, either unconditionally or conditioned on partial sequences representing known functional motifs. The zero-shot fitness scoring capability allows rapid in silico screening of variant libraries — for instance, prioritizing mutations for experimental testing without requiring labeled training data. The structural plausibility of ProGen2-generated sequences, validated via AlphaFold2 predictions, makes the model suitable as a first-pass generative tool for enzyme design, antibody engineering, and the exploration of sequence space in understudied protein families. The availability of multiple model sizes also makes ProGen2 practical in resource-constrained environments where the smallest variants can run on a single GPU.
ProGen2 represents one of the most systematic scaling studies conducted for protein language models and established several results that have influenced subsequent model development. The counter-intuitive finding that fitness prediction performance peaks below the largest model size has been widely cited in discussions of how to select and evaluate protein foundation models. The work helped define the experimental framework that benchmarks such as ProteinGym subsequently operationalized at larger scale. The open release of all five model weights through the Salesforce GitHub repository has made ProGen2 a widely used baseline and generative backbone in the protein design community. A practical limitation of the model family is that, as pure sequence models, ProGen2 variants operate without explicit structural information and cannot directly incorporate three-dimensional constraints into generation or scoring — a gap addressed by subsequent hybrid sequence-structure models.