ESM-2 and ESMFold represent a landmark contribution from Meta AI to the field of protein science, demonstrating that scaling protein language models to evolutionary scale unlocks an emergent ability to predict three-dimensional protein structure. Published in Science in 2023 by Lin et al., the work showed that as transformer models trained on protein sequences are scaled from 8 million up to 15 billion parameters, an increasingly accurate atomic-resolution picture of protein structure emerges directly from the learned sequence representations — without ever seeing a structure during pretraining.
The central insight is that the statistical patterns encoded across hundreds of millions of evolutionary sequences contain sufficient geometric constraints to reconstruct how proteins fold. ESMFold exploits this by pairing the 650M-parameter ESM-2 encoder with a folding trunk and structure module that converts language model embeddings directly into all-atom coordinates. Because no multiple sequence alignment (MSA) is required at inference time, ESMFold operates 10- to 60-fold faster than AlphaFold 2 on equivalent hardware, enabling structural characterization at a scale that MSA-dependent methods cannot match.
The practical consequence of this speed advantage was demonstrated immediately: Meta AI used ESMFold to predict structures for more than 617 million metagenomic protein sequences, releasing the results as the ESM Metagenomic Atlas — the first large-scale structural view of the so-called "dark matter" of the protein universe, the vast majority of protein diversity that has never been experimentally characterized.
ESM-2 is a transformer encoder trained with a masked language modeling (MLM) objective, where random amino acid positions are masked and the model is trained to predict the original residues from context. The architecture scales from 6 layers and 320 hidden dimensions (8M parameters) up to 48 layers and 5,120 hidden dimensions (15B parameters). Training uses sequences uniformly sampled across approximately 43 million UniRef50 clusters, ensuring broad coverage of protein sequence space. The 15B-parameter variant achieves a TM-score of 0.713 on CAMEO and 0.539 on CASP14, compared to 0.649 and 0.475 for the 150M-parameter model.
ESMFold couples the 650M ESM-2 backbone with a folding trunk composed of 48 blocks that process per-residue and pairwise representations, followed by a structure module (shared with AlphaFold 2's design) that outputs backbone frames and side-chain torsion angles. On sequences with sufficient evolutionary context (low perplexity), ESMFold approaches AlphaFold 2 accuracy while running orders of magnitude faster. On CASP14, ESMFold achieves a TM-score of 0.68 versus 0.85 for AlphaFold 2 with full MSA, but substantially outperforms AlphaFold 2 in single-sequence mode (0.68 vs. 0.37), confirming that the language model embeddings carry genuinely independent structural signal.
ESM-2 and ESMFold serve a wide range of researchers in computational and experimental biology. As a pretrained sequence encoder, ESM-2 provides state-of-the-art representations for downstream tasks including function annotation, variant effect prediction, protein-protein interaction prediction, and fitness landscape modeling — areas where the 650M and 3B models offer a practical balance of accuracy and resource requirements. ESMFold is particularly valuable for high-throughput structural characterization: metagenomic and environmental sequencing projects routinely generate millions of sequences from unknown organisms where MSA construction is infeasible, and ESMFold enables rapid structural triage of these datasets. The model also supports drug discovery workflows by enabling fast virtual screening, domain boundary identification, and structural clustering across large protein families.
The ESM-2 and ESMFold paper provided the field with strong evidence that protein structure is a learnable consequence of sequence statistics at sufficient scale, shifting the theoretical framing of why language models work for proteins. The ESM Metagenomic Atlas made structural biology accessible at metagenomics scale for the first time, and the dataset has since been used in studies ranging from enzyme discovery to viral protein characterization. The ESM-2 encoders became some of the most widely adopted protein representations in the literature, forming the backbone of numerous downstream models. A meaningful limitation is that ESMFold's accuracy falls behind AlphaFold 2 on hard targets, particularly those with shallow evolutionary histories where MSA-based co-evolutionary signals are most informative. The subsequent release of ESM-3 by EvolutionaryScale extended this line of work to a multimodal model jointly reasoning over sequence, structure, and function.
Lin, Z., et al. (2022) Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv.
DOI: 10.1126/science.ade2574