ESMC

Protein language model trained on roughly 2.8 billion sequences, forming the representation core of Biohub's world model of protein biology.

Released: May 2026

ESMC (Evolutionary Scale Modeling Cambrian) is a protein language model released by Biohub on May 27, 2026 as the representation core of its "world model of protein biology." Trained on approximately 2.8 billion protein sequences spanning the tree of life — from bacteria and extremophiles to more than 20,000 types of human proteins — ESMC learns the rules that govern how proteins fold, interact, and function from sequence alone, without multiple sequence alignments or structural inputs.

This entry covers the 2026 Biohub release, which is distinct from the December 2024 ESM Cambrian family (300M / 600M / 6B) published by EvolutionaryScale and documented as a separate ESM Cambrian (2024) entry on bio.rodeo. ESMC shares that lineage but is a retrained, re-released model issued under the Biohub umbrella — the organization formed by combining CZI Science, CZ Biohub, and the acquired EvolutionaryScale, with Alex Rives as Head of Science. Where the 2024 release positioned ESM-C as a standalone representation model, the 2026 version is published as one component of an integrated system.

ESMC anchors a three-part release alongside ESMFold2, a looped-transformer structure prediction and design engine, and the ESM Atlas, a navigable map of 6.8 billion protein sequences and 1.1 billion predicted structures. The work is described in a preprint, "Language Modeling Materializes a World Model of Protein Biology" (Candido et al., 2026), and all components are distributed under the permissive MIT license for commercial and non-commercial use.

Key Features

Sequence-only foundation model: ESMC learns structure- and function-aware representations purely from amino acid sequence, capturing evolutionary signal without requiring MSAs or experimental structures.
Tree-of-life training scale: Trained on ~2.8 billion sequences sampled broadly across biology, including extremophiles and a deep catalog of human proteins, giving wide taxonomic coverage.
Hierarchical internal organization: Analyses report that the model organizes its representations hierarchically — from amino acid chemistry to local structure to higher-level functional concepts — making its internal features interpretable.
MIT-licensed and openly distributed: Released under the MIT license through the Biohub platform, GitHub, and HuggingFace, with sparse autoencoder (SAE) feature sets published for its hidden states.
Core of an integrated system: ESMC supplies the representations consumed by ESMFold2 for structure prediction and binder design and by the ESM Atlas for large-scale navigation of protein space.

Technical Details

ESMC is a transformer-based protein language model trained with a masked language modeling objective on roughly 2.8 billion sequences. The release continues the Cambrian model family, whose largest reported variant is a 6-billion-parameter model (ESMC 6B) with a 2048-token context window; the 2024 lineage used modern transformer design choices including rotary positional embeddings, SwiGLU activations, and pre-LayerNorm. Biohub additionally publishes sparse autoencoders trained on ESMC hidden states across all layers, enabling mechanistic interpretation of the learned features.

The representations feed ESMFold2, which uses a looped transformer that scales compute at inference time to predict atomic-resolution structures of proteins and biomolecular complexes. In the accompanying preprint, ESMFold2 is benchmarked against Chai-1, Boltz-1, and AlphaFold 3, and the system was validated in the lab for binder design, reporting hit rates of 36–88% for compact mini-binders and 15–29% for antibody-derived formats, with some designs reaching nanomolar affinity. The supporting preprint, "Language Modeling Materializes a World Model of Protein Biology" (Chan Zuckerberg Biohub), was posted to bioRxiv on June 4, 2026 (DOI 10.64898/2026.06.03.729735) under a CC-BY license.

Applications

ESMC serves as a general-purpose protein sequence encoder for researchers needing high-quality embeddings for variant effect and fitness prediction, function and property annotation, and similarity search across large sequence collections. As the representation layer of Biohub's world model, it underpins structure prediction and de novo binder design workflows via ESMFold2 and powers exploration of the 6.8-billion-protein ESM Atlas. Distribution partners including AWS Bio Discovery, Benchling, Tamarind Bio, Modal, and SandboxAQ make the model accessible across cloud, lab, and drug-discovery platforms, benefiting protein engineers, computational biologists, and therapeutic discovery teams.

Impact

The 2026 ESMC release reframes Biohub's protein language modeling as part of a unified, openly licensed system spanning representation, structure, and large-scale annotation. By shipping ESMC, ESMFold2, and the ESM Atlas together under the MIT license, Biohub lowers the barrier to commercial and academic use of frontier protein models and positions the ESM Atlas — covering 6.8 billion sequences and 1.1 billion structures — as one of the largest applications of AI to protein biology to date. Key caveats follow from its status: the supporting paper is a preprint that has not been peer reviewed, the model remains sequence-only with a bounded context window, and lab-validated binder design hit rates, while strong, vary considerably by format.

Citation

Language Modeling Materializes a World Model of Protein Biology

Candido, S., et al. (2026) Language Modeling Materializes a World Model of Protein Biology. bioRxiv.

DOI: 10.64898/2026.06.03.729735

Recent citations

Papers that recently cited this model.

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology
Hyunjin Seo, Hyeon Hwang, Gyubok Lee, et al.
Jul 2026
0
AbICL: In-Context Learning for Antigen-Specific Antibody Affinity Ranking
Zhiyuan Chen, Jing Hu, Junzhe Wang, et al.
Jul 2026
0
Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition
Rui Yin, S. Saravanakumar, Shu Yuan Shi, et al.
bioRxiv · Jul 2026
0

Top citations

The most-cited papers that cite this model.

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology
Hyunjin Seo, Hyeon Hwang, Gyubok Lee, et al.
Jul 2026
0
Overestimating zero-shot fitness prediction: Broad benchmarks mask local failures and practical limitations
Phillip R. Woolley, Aaron L. Feller, Andrew D. Ellington, et al.
bioRxiv · Jun 2026
0
AbICL: In-Context Learning for Antigen-Specific Antibody Affinity Ranking
Zhiyuan Chen, Jing Hu, Junzhe Wang, et al.
Jul 2026
0
Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition
Rui Yin, S. Saravanakumar, Shu Yuan Shi, et al.
bioRxiv · Jul 2026
0
Folding, Reasoning, and Scaling with Open-source Drug Discovery Engine
Aureka AI OpenDDE project
Jul 2026
0Influential

Citations

Total Citations8

Influential2

References0

GitHub

Stars2.9K

Forks365

Open Issues83

Contributors22

Last Push3d ago

LanguageJupyter Notebook

HuggingFace

Downloads2.1M

Likes20

Last Modified1mo ago

Pipelinefill-mask

Fields of citing research

Biology100%
Computer Science100%
Medicine50%
Chemistry13%
Environmental Science13%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

63Partial

Usability — can I run it?100

Reproducibility — can I retrain it?12

open weights, closed recipe

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Official Website HuggingFace Model

Key Features

Sequence-only foundation model: ESMC learns structure- and function-aware representations purely from amino acid sequence, capturing evolutionary signal without requiring MSAs or experimental structures.

Tree-of-life training scale: Trained on ~2.8 billion sequences sampled broadly across biology, including extremophiles and a deep catalog of human proteins, giving wide taxonomic coverage.

Hierarchical internal organization: Analyses report that the model organizes its representations hierarchically — from amino acid chemistry to local structure to higher-level functional concepts — making its internal features interpretable.

MIT-licensed and openly distributed: Released under the MIT license through the Biohub platform, GitHub, and HuggingFace, with sparse autoencoder (SAE) feature sets published for its hidden states.

Core of an integrated system: ESMC supplies the representations consumed by ESMFold2 for structure prediction and binder design and by the ESM Atlas for large-scale navigation of protein space.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

Hyunjin Seo, Hyeon Hwang, Gyubok Lee, et al.

Jul 2026

AbICL: In-Context Learning for Antigen-Specific Antibody Affinity Ranking

Zhiyuan Chen, Jing Hu, Junzhe Wang, et al.

Jul 2026

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Rui Yin, S. Saravanakumar, Shu Yuan Shi, et al.

bioRxiv · Jul 2026

Top citations

The most-cited papers that cite this model.

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

Hyunjin Seo, Hyeon Hwang, Gyubok Lee, et al.

Jul 2026

Overestimating zero-shot fitness prediction: Broad benchmarks mask local failures and practical limitations

Phillip R. Woolley, Aaron L. Feller, Andrew D. Ellington, et al.

bioRxiv · Jun 2026

AbICL: In-Context Learning for Antigen-Specific Antibody Affinity Ranking

Zhiyuan Chen, Jing Hu, Junzhe Wang, et al.

Jul 2026

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Rui Yin, S. Saravanakumar, Shu Yuan Shi, et al.

bioRxiv · Jul 2026

Folding, Reasoning, and Scaling with Open-source Drug Discovery Engine

Aureka AI OpenDDE project

Jul 2026

0Influential

ESMC

#Key Features

#Technical Details

#Applications

#Impact

Citation

Language Modeling Materializes a World Model of Protein Biology

Recent citations

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

AbICL: In-Context Learning for Antigen-Specific Antibody Affinity Ranking

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Top citations

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

Overestimating zero-shot fitness prediction: Broad benchmarks mask local failures and practical limitations

AbICL: In-Context Learning for Antigen-Specific Antibody Affinity Ranking

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Folding, Reasoning, and Scaling with Open-source Drug Discovery Engine

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ESMC

#Key Features

#Technical Details

#Applications

#Impact

Citation

Language Modeling Materializes a World Model of Protein Biology

Recent citations

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

AbICL: In-Context Learning for Antigen-Specific Antibody Affinity Ranking

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Top citations

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

Overestimating zero-shot fitness prediction: Broad benchmarks mask local failures and practical limitations

AbICL: In-Context Learning for Antigen-Specific Antibody Affinity Ranking

Benchmarking AlphaFold and related deep learning approaches for modeling antibody and TCR antigen recognition

Folding, Reasoning, and Scaling with Open-source Drug Discovery Engine

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact