Overview

ESM-2 and ESMFold represent a landmark contribution from Meta AI to the field of protein science, demonstrating that scaling protein language models to evolutionary scale unlocks an emergent ability to predict three-dimensional protein structure. Published in Science in 2023 by Lin et al., the work showed that as transformer models trained on protein sequences are scaled from 8 million up to 15 billion parameters, an increasingly accurate atomic-resolution picture of protein structure emerges directly from the learned sequence representations — without ever seeing a structure during pretraining.

The central insight is that the statistical patterns encoded across hundreds of millions of evolutionary sequences contain sufficient geometric constraints to reconstruct how proteins fold. ESMFold exploits this by pairing the 650M-parameter ESM-2 encoder with a folding trunk and structure module that converts language model embeddings directly into all-atom coordinates. Because no multiple sequence alignment (MSA) is required at inference time, ESMFold operates 10- to 60-fold faster than AlphaFold 2 on equivalent hardware, enabling structural characterization at a scale that MSA-dependent methods cannot match.

The practical consequence of this speed advantage was demonstrated immediately: Meta AI used ESMFold to predict structures for more than 617 million metagenomic protein sequences, releasing the results as the ESM Metagenomic Atlas — the first large-scale structural view of the so-called "dark matter" of the protein universe, the vast majority of protein diversity that has never been experimentally characterized.

Key Features

Evolutionary-scale pretraining: ESM-2 is trained with masked language modeling on approximately 65 million unique sequences drawn from UniRef90, spanning the full breadth of known protein diversity across all domains of life.
Scalable model family: The ESM-2 family covers six model sizes from 8M to 15B parameters (8M, 150M, 650M, 3B, 15B), each available as a pretrained encoder, enabling researchers to choose the appropriate scale-vs-speed tradeoff for their application.
MSA-free structure prediction: ESMFold predicts full atomic-level protein structures from a single sequence in seconds, bypassing the computationally expensive MSA step that AlphaFold 2 and RoseTTAFold require.
Structural emergence at scale: TM-score on CAMEO improves from roughly 0.65 for the 150M-parameter model to 0.713 for the 15B-parameter model, directly demonstrating that structural knowledge scales with model size.
ESM Metagenomic Atlas: The speed of ESMFold enabled prediction of structures for over 617 million metagenomic proteins, with more than 225 million predicted at high confidence, making this the largest publicly available protein structure resource.
HuggingFace integration: ESM-2 and ESMFold are natively supported in the Transformers library, enabling straightforward access from Python with standard model-loading APIs.

Technical Details

ESM-2 is a transformer encoder trained with a masked language modeling (MLM) objective, where random amino acid positions are masked and the model is trained to predict the original residues from context. The architecture scales from 6 layers and 320 hidden dimensions (8M parameters) up to 48 layers and 5,120 hidden dimensions (15B parameters). Training uses sequences uniformly sampled across approximately 43 million UniRef50 clusters, ensuring broad coverage of protein sequence space. The 15B-parameter variant achieves a TM-score of 0.713 on CAMEO and 0.539 on CASP14, compared to 0.649 and 0.475 for the 150M-parameter model.

ESMFold couples the 650M ESM-2 backbone with a folding trunk composed of 48 blocks that process per-residue and pairwise representations, followed by a structure module (shared with AlphaFold 2's design) that outputs backbone frames and side-chain torsion angles. On sequences with sufficient evolutionary context (low perplexity), ESMFold approaches AlphaFold 2 accuracy while running orders of magnitude faster. On CASP14, ESMFold achieves a TM-score of 0.68 versus 0.85 for AlphaFold 2 with full MSA, but substantially outperforms AlphaFold 2 in single-sequence mode (0.68 vs. 0.37), confirming that the language model embeddings carry genuinely independent structural signal.

Applications

ESM-2 and ESMFold serve a wide range of researchers in computational and experimental biology. As a pretrained sequence encoder, ESM-2 provides state-of-the-art representations for downstream tasks including function annotation, variant effect prediction, protein-protein interaction prediction, and fitness landscape modeling — areas where the 650M and 3B models offer a practical balance of accuracy and resource requirements. ESMFold is particularly valuable for high-throughput structural characterization: metagenomic and environmental sequencing projects routinely generate millions of sequences from unknown organisms where MSA construction is infeasible, and ESMFold enables rapid structural triage of these datasets. The model also supports drug discovery workflows by enabling fast virtual screening, domain boundary identification, and structural clustering across large protein families.

Impact

The ESM-2 and ESMFold paper provided the field with strong evidence that protein structure is a learnable consequence of sequence statistics at sufficient scale, shifting the theoretical framing of why language models work for proteins. The ESM Metagenomic Atlas made structural biology accessible at metagenomics scale for the first time, and the dataset has since been used in studies ranging from enzyme discovery to viral protein characterization. The ESM-2 encoders became some of the most widely adopted protein representations in the literature, forming the backbone of numerous downstream models. A meaningful limitation is that ESMFold's accuracy falls behind AlphaFold 2 on hard targets, particularly those with shallow evolutionary histories where MSA-based co-evolutionary signals are most informative. The subsequent release of ESM-3 by EvolutionaryScale extended this line of work to a multimodal model jointly reasoning over sequence, structure, and function.

Overview

Key Features

Evolutionary-scale pretraining: ESM-2 is trained with masked language modeling on approximately 65 million unique sequences drawn from UniRef90, spanning the full breadth of known protein diversity across all domains of life.

Scalable model family: The ESM-2 family covers six model sizes from 8M to 15B parameters (8M, 150M, 650M, 3B, 15B), each available as a pretrained encoder, enabling researchers to choose the appropriate scale-vs-speed tradeoff for their application.

MSA-free structure prediction: ESMFold predicts full atomic-level protein structures from a single sequence in seconds, bypassing the computationally expensive MSA step that AlphaFold 2 and RoseTTAFold require.

Structural emergence at scale: TM-score on CAMEO improves from roughly 0.65 for the 150M-parameter model to 0.713 for the 15B-parameter model, directly demonstrating that structural knowledge scales with model size.

ESM Metagenomic Atlas: The speed of ESMFold enabled prediction of structures for over 617 million metagenomic proteins, with more than 225 million predicted at high confidence, making this the largest publicly available protein structure resource.

HuggingFace integration: ESM-2 and ESMFold are natively supported in the Transformers library, enabling straightforward access from Python with standard model-loading APIs.

Technical Details

Applications

Impact

ESM-2 & ESMFold

Overview

Key Features

Technical Details

Applications

Impact

Citation

Evolutionary-scale prediction of atomic level protein structure with a language model

Metrics

GitHub

Citations

HuggingFace

Tags

Resources

ESM-2 & ESMFold

Overview

Key Features

Technical Details

Applications

Impact

Citation

Evolutionary-scale prediction of atomic level protein structure with a language model

Metrics

GitHub

Citations

HuggingFace

Tags

Resources