Profluent
Sparse mixture-of-experts autoregressive protein language model family pretrained on 1.5 trillion amino acid tokens with compute-optimal scaling.
ProGen3 is a family of generative protein language models developed by Profluent, representing the third generation of the ProGen model series that began with the original ProGen (2022) and ProGen2 (2023). Released in April 2025 as a bioRxiv preprint, ProGen3 scales autoregressive protein sequence generation to an unprecedented 46 billion total parameters through the use of a sparse mixture-of-experts (MoE) architecture that activates only 27% of its parameters per forward pass. The model family spans eight sizes from 112 million to 46 billion parameters, all pretrained on 1.5 trillion amino acid tokens from the Profluent Protein Atlas v1 — a curated collection of 3.4 billion full-length protein sequences — enabling the first systematic characterization of compute-optimal scaling laws for sparse generative protein language models.
ProGen3's development is motivated by a central question in protein language model research: does more compute lead to better protein generators, and if so, how does performance scale with parameter count, training tokens, and model architecture? The ProGen2 scaling study (2023) revealed a non-obvious dissociation between perplexity and fitness prediction performance in dense autoregressive models, with zero-shot fitness prediction peaking at 764 million parameters and declining for larger dense models. ProGen3 extends this analysis to the sparse MoE regime, where models can achieve much larger total parameter counts without proportional increases in per-token compute, and characterizes whether biological sequence generation benefits from the same MoE scaling dynamics that have proven transformative in natural language processing.
The results demonstrate clear scaling benefits for protein generation quality in the MoE regime: larger ProGen3 models generate sequences that cover a substantially broader diversity of protein families, with ProGen3-46B generating 59% more unique sequences (at 30% sequence identity cutoff) than ProGen3-3B and 198% more than ProGen3-339M. Critically, larger models also show improved responsiveness to alignment with laboratory fitness data — a property called "alignability" — meaning that larger models can be fine-tuned more effectively on experimental measurements to generate proteins with improved predicted fitness. Model weights for multiple sizes are publicly available on Hugging Face, and inference code is released through the Profluent-AI GitHub organization under a non-commercial research license.
ProGen3 models are autoregressive transformers with a decoder-only architecture trained on next-token prediction over amino acid sequences — the same fundamental design as ProGen2, but extended with sparse MoE layers in place of the dense feed-forward networks used in standard transformers. In the MoE design, each transformer block contains a mixture-of-experts feed-forward layer where multiple expert networks operate in parallel and a routing mechanism selects which experts process each token. ProGen3 activates 27% of total model parameters per forward pass, meaning the computational cost of processing a single protein sequence is substantially less than the total parameter count would suggest. All models operate with a context length of 8,192 tokens, accommodating full-length proteins up to 8,192 amino acids — sufficient for the vast majority of known proteins including many large multidomain proteins.
The training corpus, the Profluent Protein Atlas v1, comprises 3.4 billion full-length protein sequences assembled and curated by Profluent from public sequence databases with additional filtering to remove redundancy, spurious sequences, and low-quality assemblies. Training used 1.5 trillion amino acid tokens, which based on the compute-optimal scaling analysis represents an approximately Chinchilla-optimal allocation for the largest model sizes. The compute-optimal scaling analysis — which determines the ratio of model parameters to training tokens that maximizes model quality for a given training budget — is the first such analysis performed for sparse protein language models and provides principled guidance for future large-scale protein model development.
Model sizes in the ProGen3 family span: 112M, 200M (approximate), 339M, 800M (approximate), 1B, 3B, and 46B parameters (total), with activation fractions of approximately 27% for the MoE models. The diversity improvement from scale is quantified using a clustering-based metric: unique sequences at 30% sequence identity (30% ID), which measures how many of the generated sequences are meaningfully distinct from each other at a threshold that approximates protein family boundaries. The 59% diversity improvement of 46B over 3B and 198% improvement over 339M represent substantial generalization gains — larger models are not just generating better sequences in protein families the model has seen, but exploring a much wider range of protein architectures and sequence space regions.
Alignability — the capacity of the model to benefit from fine-tuning on experimental fitness data — was assessed by fine-tuning models of different sizes on labeled measurements from deep mutational scanning experiments and evaluating both fitness prediction accuracy (zero-shot scoring performance) and sequence generation quality (fraction of generated sequences with high predicted fitness). Larger models show steeper improvement from alignment on both metrics, suggesting that the representational capacity of larger models allows them to incorporate experimental feedback more effectively — a finding with direct implications for active learning and iterative design workflows in protein engineering. The model weights for multiple ProGen3 sizes are available through the Profluent-Bio collection on Hugging Face, with inference and generation code released in the Profluent-AI/progen3 GitHub repository under a non-commercial research license.
ProGen3 is designed as a frontier-scale protein sequence generator for researchers who need to explore protein sequence space broadly and generate diverse candidates for experimental testing. Protein engineers engaged in directed evolution, enzyme design, and therapeutic protein development can use ProGen3 to generate large libraries of novel sequences spanning a desired protein family, conditioning generation on partial sequences or functional motifs to focus exploration on relevant regions of sequence space. The model supports both full-length protein generation and domain-level infilling — generating novel sequences for a specific domain within an existing protein context — enabling applications such as CDR-H3 loop design for antibodies, active site engineering for enzymes, and linker design for fusion proteins. The diversity scaling result has particular practical value for exploratory protein design campaigns where the goal is to identify functional sequences in novel or sparsely characterized protein families: larger ProGen3 models are more likely to generate viable sequences for protein families that are distant from those well-represented in the training data. The demonstrated alignability of larger models also positions ProGen3 as a suitable backbone for closed-loop protein engineering workflows, where each round of experimental testing provides labeled data that is used to fine-tune the model for the next generation of sequence proposals. Researchers in academic settings can access model weights for the smaller ProGen3 sizes under the non-commercial license without requiring access to Profluent's internal infrastructure.
ProGen3 makes two major contributions to the protein language model field that extend beyond the practical utility of the model weights themselves. First, it establishes empirically that the scaling dynamics established in NLP — where larger models trained on more data produce qualitatively better generators — apply to autoregressive protein language models in the sparse MoE regime. The 198% diversity improvement from 339M to 46B parameters is not a marginal gain; it suggests that current protein engineering workflows that use models in the billion-parameter range may be operating significantly below the capability ceiling of what large-scale autoregressive generation can provide. Second, the characterization of compute-optimal scaling laws for sparse protein models provides the community with principled guidance for training resource allocation — the equivalent of the Chinchilla scaling analysis for protein language models. Prior protein model training decisions were often made ad hoc or based on intuitions transferred from NLP without systematic validation for protein data; ProGen3's analysis provides an empirical foundation for these decisions. As a product of Profluent — the same organization that produced OpenCRISPR-1 using ProGen2 as the generative backbone — ProGen3 is positioned as the foundation for next-generation AI-designed proteins, gene editors, and other biological molecules. The non-commercial research license restricts direct commercial use of the released weights, which represents a different access model than fully open releases such as those in the AIDO platform. Current limitations include the absence of structural conditioning at generation time (ProGen3 is a pure sequence model) and the non-commercial license that constrains some downstream applications.