Overview

ESM-3 is a frontier multimodal generative protein language model developed by EvolutionaryScale and released in June 2024. Unlike prior protein language models that reason exclusively over amino acid sequences, ESM-3 jointly models the sequence, three-dimensional structure, and functional annotations of proteins within a unified framework. This design allows the model to be prompted with any combination of the three modalities — for example, specifying a partial structure while leaving the sequence unconstrained — and then generate coherent completions across all tracks simultaneously.

The model represents a significant departure from the ESM lineage developed at Meta AI Research. Where ESM-2 was a sequence-only masked language model with up to 15 billion parameters, ESM-3 was built from the ground up as a generative, multimodal system. At its largest scale, ESM-3 contains 98 billion parameters and was trained with over 10^24 floating-point operations — more compute than any previously known biology foundation model — on a dataset of 2.78 billion proteins spanning the full diversity of life on Earth.

The model's most striking demonstration was the de novo design of esmGFP, a green fluorescent protein sharing only 58% sequence identity with the nearest known natural fluorescent protein. The authors estimate this degree of sequence divergence corresponds to roughly 500 million years of natural evolution, underscoring the model's capacity to explore protein space far beyond the boundaries of observed biology.

Key Features

Three-modality reasoning: Sequence, structure, and function are each tokenized as discrete tracks and fused into a single shared latent space, allowing simultaneous conditioning and generation across all three representations.
Flexible prompting: Users can specify any combination of sequence residues, backbone coordinates, or InterPro functional keywords as input, and ESM-3 will generate consistent completions — enabling use cases from inverse folding to functional design without retraining.
Multi-scale model family: ESM-3 is available at three parameter scales — 1.4B (open-weight), 7B, and 98B — trading off computational cost against generative capability, with the 1.4B model publicly accessible on HuggingFace.
Structure tokenization via VQ-VAE: Backbone atomic coordinates are compressed into discrete structure tokens using a learned vector-quantized variational autoencoder, enabling efficient transformer-based modeling of 3D geometry alongside sequence.
Generative protein design: ESM-3 functions natively as a generative model, producing novel protein sequences and structures conditioned on partial inputs rather than requiring separate fine-tuning for design tasks.
Massive training scale: The 98B model was trained on 2.78 billion natural protein sequences, 236 million protein structures (including AlphaFold predictions), and 539 million function-annotated sequences, totaling 771 billion unique tokens.

Technical Details

ESM-3's architecture is a multi-track transformer in which sequence, structure, and function tokens are processed through a shared stack of transformer blocks. A geometric attention module in the first block enables direct conditioning on atomic coordinates. Structure tokens are derived from backbone (N, Ca, C, O) atomic positions via a VQ-VAE trained separately. Function annotations are encoded from InterPro domain calls using a library of hidden Markov models. All three token tracks are fused within a single latent space and trained jointly using a masked language modeling objective: for each protein, a mix of sequence, structure, and function positions is masked, and the model learns to predict the masked values.

Training data was assembled from UniRef, MGnify, and other public sequence databases for sequences; the Protein Data Bank supplemented with AlphaFold2 and ESMFold predictions for structure; and function predictions from profile HMMs for annotations. The three scales — 1.4B, 7B, and 98B parameters — were trained at progressively higher compute budgets up to 1.07 × 10^24 FLOPs. The 98B model uses approximately 25 times more compute and 60 times more data than ESM-2. Structure prediction benchmarks show ESM-3 outperforms ESMFold, though it falls short of AlphaFold2 on standard benchmarks. In generative evaluations, ESM-3 achieves greater than 50% functional success rates when designing variants in distant sequence families.

Applications

ESM-3 is suited to a broad range of protein engineering and discovery tasks. Researchers can use the model for inverse folding — generating sequences that fold into a specified backbone — as well as for exploring sequence space around a functional protein to find distant homologs with altered properties. The functional conditioning mechanism supports targeted design toward proteins bearing specific InterPro domains. In drug discovery, the model can be used to generate novel antibody or enzyme scaffolds. In basic research, its multimodal representations provide rich embeddings for downstream tasks including variant effect prediction, contact prediction, and protein family classification. The open 1.4B checkpoint is accessible via the EvolutionaryScale Python SDK and HuggingFace, while the larger proprietary models are available through the EvolutionaryScale API.

Impact

ESM-3 represents a conceptual shift in protein foundation models from discriminative sequence encoders to generative multimodal systems. Its publication in Science in January 2025 (Hayes et al.) brought broad attention to the capacity of language models to explore protein space at evolutionary timescales, with the esmGFP result serving as a concrete proof of capability. EvolutionaryScale launched alongside the model with a $142 million seed round, signaling significant commercial interest in generative biology. A key limitation is access: only the 1.4B parameter model is fully open-weight, while the more capable 7B and 98B variants require API access through EvolutionaryScale's platform. The model also inherits the standard caveats of language model approaches — generated sequences require experimental validation, and performance on proteins with little training data coverage may be reduced.

Overview

Key Features

Three-modality reasoning: Sequence, structure, and function are each tokenized as discrete tracks and fused into a single shared latent space, allowing simultaneous conditioning and generation across all three representations.

Flexible prompting: Users can specify any combination of sequence residues, backbone coordinates, or InterPro functional keywords as input, and ESM-3 will generate consistent completions — enabling use cases from inverse folding to functional design without retraining.

Multi-scale model family: ESM-3 is available at three parameter scales — 1.4B (open-weight), 7B, and 98B — trading off computational cost against generative capability, with the 1.4B model publicly accessible on HuggingFace.

Structure tokenization via VQ-VAE: Backbone atomic coordinates are compressed into discrete structure tokens using a learned vector-quantized variational autoencoder, enabling efficient transformer-based modeling of 3D geometry alongside sequence.

Generative protein design: ESM-3 functions natively as a generative model, producing novel protein sequences and structures conditioned on partial inputs rather than requiring separate fine-tuning for design tasks.

Massive training scale: The 98B model was trained on 2.78 billion natural protein sequences, 236 million protein structures (including AlphaFold predictions), and 539 million function-annotated sequences, totaling 771 billion unique tokens.

Technical Details

Applications

Impact

ESM-3

Overview

Key Features

Technical Details

Applications

Impact

Citation

Simulating 500 million years of evolution with a language model.

Metrics

GitHub

Citations

HuggingFace

Tags

Resources

ESM-3

Overview

Key Features

Technical Details

Applications

Impact

Citation

Simulating 500 million years of evolution with a language model.

Metrics

GitHub

Citations

HuggingFace

Tags

Resources