Overview

ESM-1b (Evolutionary Scale Modeling) is a protein language model developed by Facebook AI Research (FAIR) that demonstrated for the first time that large-scale unsupervised learning on protein sequences alone is sufficient to encode biologically meaningful structural and functional information. Published in PNAS in April 2021, the work by Rives et al. showed that training a transformer encoder on 250 million diverse protein sequences causes structural properties — from local secondary structure to remote evolutionary relationships — to emerge as learnable features in the model's representations, without any explicit structural supervision.

The central insight of ESM-1b is that the statistical regularities encoded in evolutionary sequence data are rich enough to reconstruct biology at multiple scales. When the model is trained to predict masked amino acid residues from sequence context, it implicitly learns the co-evolutionary constraints that underpin protein folding. Linear probes applied to the resulting representations can recover secondary structure, contact maps, and remote homology with performance competitive with dedicated computational methods that use explicit multiple sequence alignments (MSAs).

ESM-1b established the proof of concept for the modern generation of protein language models. It belongs to the broader ESM family from Meta AI, which subsequently grew to include ESM-1v (zero-shot variant effect prediction), ESM-MSA-1b (MSA Transformer), ESM-2 (up to 15B parameters), and ESMFold (structure prediction from a single sequence), making it one of the most consequential foundational works in computational protein science.

Key Features

Unsupervised structural learning: Despite never seeing experimental structures during training, ESM-1b representations support accurate prediction of secondary structure (Q3 accuracy 71.6%) and inter-residue contacts (P@L/5 ≈ 0.61), competitive with alignment-based methods.
Multi-scale biological organization: The learned embedding space organizes proteins at multiple levels of abstraction — from amino acid biochemical properties to family-level remote homology — without explicit hierarchical supervision.
Single-sequence inference: Unlike MSA-dependent methods, ESM-1b generates representations from individual sequences alone, making it applicable to orphan proteins or rapid large-scale screening where alignment construction is impractical.
Attention maps as contact predictors: The model's self-attention heads capture inter-residue co-evolutionary signals, and averaging attention across heads produces contact map predictions that improve with model scale.
Scalable masked language modeling: Trained with a 15% masking rate on the UR50/S dataset (UniRef50 split), the architecture follows a standard BERT-style encoder, making the approach straightforward to reproduce and extend.
Foundation for ESM family: ESM-1b directly seeded ESM-2 and ESMFold, which extended the same pretraining paradigm to 15 billion parameters and end-to-end structure prediction respectively.

Technical Details

ESM-1b is a 33-layer transformer encoder with a hidden dimension of 1,280, 20 attention heads, and feed-forward intermediate dimensions of 5,120, totaling approximately 650 million parameters. It was trained using a masked language modeling (MLM) objective — identical in spirit to BERT — where 15% of input amino acid tokens are masked and the model is trained to predict the original residue from the remaining sequence context. The training corpus was the UR50/S dataset, a high-diversity split of UniRef50 containing 250 million protein sequences representing 86 billion amino acids sampled from across the tree of life.

Benchmark evaluation showed ESM-1b achieving secondary structure Q3 accuracy of 71.6%, matching the performance of HMM-profile-based methods (71.2%) and exceeding published RaptorX results (70.6%) on the same CB513 benchmark — all from a linear probe with no fine-tuning of the backbone. Contact prediction from attention maps reached a precision at L/5 (P@L/5) of approximately 0.61 on the ProteinNet test set. Remote homology detection evaluated on the SCOP dataset confirmed that the model's representations encode fold-level similarity without explicit training on structural labels.

Applications

ESM-1b representations serve as general-purpose protein encoders for a wide range of downstream tasks. Researchers use the frozen embeddings as input features to lightweight prediction heads for secondary structure, solubility, stability, subcellular localization, and post-translational modification classification. In drug discovery and protein engineering workflows, ESM-1b embeddings provide a fast baseline for sequence-based screening of large libraries. The model's attention maps have been used for unsupervised contact prediction, aiding structure modeling of proteins lacking close homologs. ESM-1v, which shares ESM-1b's pretraining, demonstrated that the log-likelihood scoring from the language model is directly predictive of the effect of point mutations, enabling zero-shot variant effect prediction for clinical and engineering applications. The model is widely accessible through the HuggingFace Transformers library, lowering the barrier for wet-lab biologists to apply it without deep ML expertise.

Impact

ESM-1b was a defining paper in establishing protein language models as a serious paradigm in computational biology. It catalyzed a wave of subsequent work — including ProtTrans, ESM-2, and ProGen2 — and directly influenced how the field thinks about representation learning from sequence data alone. The ESM GitHub repository has accumulated thousands of stars and the model family has been cited extensively in the structural biology and protein engineering literature. A notable limitation of ESM-1b is that its fixed sinusoidal positional embeddings constrain generalization to sequences longer than those seen during training; this was addressed in ESM-2 with rotary position embeddings. Despite being superseded in raw performance by later ESM generations and competing models, ESM-1b remains a widely used baseline and a pedagogically important example of emergent representation learning in protein science.

Overview

Key Features

Unsupervised structural learning: Despite never seeing experimental structures during training, ESM-1b representations support accurate prediction of secondary structure (Q3 accuracy 71.6%) and inter-residue contacts (P@L/5 ≈ 0.61), competitive with alignment-based methods.

Multi-scale biological organization: The learned embedding space organizes proteins at multiple levels of abstraction — from amino acid biochemical properties to family-level remote homology — without explicit hierarchical supervision.

Single-sequence inference: Unlike MSA-dependent methods, ESM-1b generates representations from individual sequences alone, making it applicable to orphan proteins or rapid large-scale screening where alignment construction is impractical.

Attention maps as contact predictors: The model's self-attention heads capture inter-residue co-evolutionary signals, and averaging attention across heads produces contact map predictions that improve with model scale.

Scalable masked language modeling: Trained with a 15% masking rate on the UR50/S dataset (UniRef50 split), the architecture follows a standard BERT-style encoder, making the approach straightforward to reproduce and extend.

Foundation for ESM family: ESM-1b directly seeded ESM-2 and ESMFold, which extended the same pretraining paradigm to 15 billion parameters and end-to-end structure prediction respectively.

Technical Details

Applications

Impact

ESM-1b

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

HuggingFace

Tags

Resources

ESM-1b

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

HuggingFace

Tags

Resources