Overview

Compute-Optimal PLM is a systematic study from BioMap Research and Tsinghua University that applies Chinchilla-style scaling law analysis to protein language models. Published in June 2024 and accepted as a spotlight paper at NeurIPS 2024, the work addresses a practical gap in the field: while protein language models had grown rapidly in scale — from BERT-sized models to 10-billion-parameter architectures — there was little principled guidance on how to allocate a fixed compute budget between model size and training data volume.

The central question the authors pose is whether existing protein language models, including ESM-2 and ProGen2, are actually trained at their compute-optimal operating point. Their empirical answer is that most are not. By training over 300 models ranging from 3.5 million to 10.7 billion parameters on between 5 and 200 billion unique tokens, the team derives scaling laws for both causal language model (CLM) and masked language model (MLM) objectives that are specific to the statistical structure of protein sequence data.

The study incorporates 939 million protein sequences, drawing on metagenomic sequences in addition to the standard UniRef database. This deliberate expansion of data diversity directly addresses two failure modes the authors identified: diminishing returns from repeated tokens under CLM training, and overfitting to UniRef under MLM training. The resulting framework allows practitioners to determine the optimal model size for any given compute budget — a practically useful contribution given the cost of training large protein models from scratch.

Key Features

Empirical scaling laws for protein sequences: Derives CLM and MLM scaling laws tuned to the statistical properties of protein sequences rather than natural language, accounting for the shorter average length and constrained vocabulary of amino acid data.
Compute-optimal model recommendations: Identifies that the 7B CLM and 10B MLM checkpoints are optimal for compute budgets matching ProGen-xlarge and ESM-2 (3B) respectively, providing a concrete lookup for practitioners.
Transfer scaling between CLM and MLM: Demonstrates that training dynamics transfer between the two objectives via an Effectively Transferred Tokens metric, enabling cross-architecture compute comparisons.
Metagenomic data augmentation: Shows that incorporating metagenomic sequences into training data substantially reduces overfitting in MLM and delays the onset of diminishing returns in CLM, improving both training efficiency and generalization.
300+ controlled experiments: The empirical foundation rests on a disciplined sweep of model scale and token budget combinations, making the derived laws robust to confounds that affect smaller ablation studies.

Technical Details

The study trains encoder-only (MLM) and decoder-only (CLM) transformer architectures across a grid of sizes from 3.5M to 10.7B parameters, each trained on between 5B and 200B unique protein sequence tokens. The training corpus combines sequences from the UniRef database with metagenomic sequences to reach 939 million total sequences. Following the Chinchilla methodology from natural language processing, the authors fit parametric loss curves of the form L(N, D) = A/N^alpha + B/D^beta + E, where N is parameter count and D is token count, and identify the optimal (N, D) frontier for a given compute budget C = 6ND.

Key findings from the fitted laws include that protein CLMs exhibit stronger diminishing returns with repeated data than NLP models, suggesting that data diversity matters more than data repetition for this modality. The optimal compute allocation for protein MLMs favors larger datasets relative to model size compared to equivalent NLP budgets. The compute-optimal 7B CLM model and 10B MLM model were released via the xTrimoPGLM HuggingFace hub. Downstream evaluations on protein generation quality and structure and function prediction benchmarks confirm that these compute-optimal models match or exceed ESM-2 and ProGen2 at equivalent or lower pretraining compute costs.

Applications

The primary audience for this work is research teams planning to train protein language models from scratch or continue pretraining existing ones. The scaling laws provide actionable guidance: given a known compute budget (expressed as total FLOPs), a team can determine in advance the optimal tradeoff between model depth and training tokens, avoiding the common failure mode of over-parameterized models trained on too little data or vice versa. Beyond model training decisions, the framework is relevant for model selection — practitioners choosing between off-the-shelf models of different sizes can use the compute-optimal frontier to identify which checkpoint is most likely to generalize well relative to its computational cost. The released 7B CLM and 10B MLM checkpoints are directly usable for downstream protein sequence tasks including fitness prediction, structure-guided generation, and functional annotation.

Impact

The acceptance of this work as a spotlight at NeurIPS 2024 signals recognition by the broader machine learning community of the maturity and importance of protein language model research. By transporting Chinchilla-style reasoning into the protein domain, the study establishes a methodological precedent for compute-optimal training of biological sequence models that extends naturally to RNA language models, genomic models, and other biological sequence modalities. A key limitation acknowledged by the authors is that the scaling laws are derived from transformer-based CLM and MLM architectures trained on sequence data alone; they do not extend directly to structure-conditioned models or multimodal architectures that jointly process sequence and structure. The finding that existing large protein language models are often not compute-optimal also implies that the field may have underinvested in data diversity relative to model scale — a practical correction that could lower the barrier for groups with modest compute resources to train competitive models.

Overview

Key Features

Empirical scaling laws for protein sequences: Derives CLM and MLM scaling laws tuned to the statistical properties of protein sequences rather than natural language, accounting for the shorter average length and constrained vocabulary of amino acid data.

Compute-optimal model recommendations: Identifies that the 7B CLM and 10B MLM checkpoints are optimal for compute budgets matching ProGen-xlarge and ESM-2 (3B) respectively, providing a concrete lookup for practitioners.

Transfer scaling between CLM and MLM: Demonstrates that training dynamics transfer between the two objectives via an Effectively Transferred Tokens metric, enabling cross-architecture compute comparisons.

Metagenomic data augmentation: Shows that incorporating metagenomic sequences into training data substantially reduces overfitting in MLM and delays the onset of diminishing returns in CLM, improving both training efficiency and generalization.

300+ controlled experiments: The empirical foundation rests on a disciplined sweep of model scale and token budget combinations, making the derived laws robust to confounds that affect smaller ablation studies.

Technical Details

Applications

Impact

Compute-Optimal PLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

Training Compute-Optimal Protein Language Models

Metrics

GitHub

Citations

Tags

Resources

Compute-Optimal PLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

Training Compute-Optimal Protein Language Models

Metrics

GitHub

Citations

Tags

Resources