bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

ESMC

Biohub

Biohub's 2026 protein language model trained on ~2.8 billion sequences, forming the representation core of its world model of protein biology.

Released: May 2026

ESMC (Evolutionary Scale Modeling Cambrian) is a protein language model released by Biohub on May 27, 2026 as the representation core of its "world model of protein biology." Trained on approximately 2.8 billion protein sequences spanning the tree of life — from bacteria and extremophiles to more than 20,000 types of human proteins — ESMC learns the rules that govern how proteins fold, interact, and function from sequence alone, without multiple sequence alignments or structural inputs.

This entry covers the 2026 Biohub release, which is distinct from the December 2024 ESM Cambrian family (300M / 600M / 6B) published by EvolutionaryScale. ESMC shares that lineage but is a retrained, re-released model issued under the Biohub umbrella — the organization formed by combining CZI Science, CZ Biohub, and the acquired EvolutionaryScale, with Alex Rives as Head of Science. Where the 2024 release positioned ESM-C as a standalone representation model, the 2026 version is published as one component of an integrated system.

ESMC anchors a three-part release alongside ESMFold2, a looped-transformer structure prediction and design engine, and the ESM Atlas, a navigable map of 6.8 billion protein sequences and 1.1 billion predicted structures. The work is described in a preprint, "Language Modeling Materializes a World Model of Protein Biology" (Candido et al., 2026), and all components are distributed under the permissive MIT license for commercial and non-commercial use.

#Key Features

  • Sequence-only foundation model: ESMC learns structure- and function-aware representations purely from amino acid sequence, capturing evolutionary signal without requiring MSAs or experimental structures.
  • Tree-of-life training scale: Trained on ~2.8 billion sequences sampled broadly across biology, including extremophiles and a deep catalog of human proteins, giving wide taxonomic coverage.
  • Hierarchical internal organization: Analyses report that the model organizes its representations hierarchically — from amino acid chemistry to local structure to higher-level functional concepts — making its internal features interpretable.
  • MIT-licensed and openly distributed: Released under the MIT license through the Biohub platform, GitHub, and HuggingFace, with sparse autoencoder (SAE) feature sets published for its hidden states.
  • Core of an integrated system: ESMC supplies the representations consumed by ESMFold2 for structure prediction and binder design and by the ESM Atlas for large-scale navigation of protein space.

#Technical Details

ESMC is a transformer-based protein language model trained with a masked language modeling objective on roughly 2.8 billion sequences. The release continues the Cambrian model family, whose largest reported variant is a 6-billion-parameter model (ESMC 6B) with a 2048-token context window; the 2024 lineage used modern transformer design choices including rotary positional embeddings, SwiGLU activations, and pre-LayerNorm. Biohub additionally publishes sparse autoencoders trained on ESMC hidden states across all layers, enabling mechanistic interpretation of the learned features.

The representations feed ESMFold2, which uses a looped transformer that scales compute at inference time to predict atomic-resolution structures of proteins and biomolecular complexes. In the accompanying preprint, ESMFold2 is benchmarked against Chai-1, Boltz-1, and AlphaFold 3, and the system was validated in the lab for binder design, reporting hit rates of 36–88% for compact mini-binders and 15–29% for antibody-derived formats, with some designs reaching nanomolar affinity. Because the preprint had not received a DOI or a bioRxiv/arXiv identifier at the time of writing, it is cited here by its hosted PDF.

#Applications

ESMC serves as a general-purpose protein sequence encoder for researchers needing high-quality embeddings for variant effect and fitness prediction, function and property annotation, and similarity search across large sequence collections. As the representation layer of Biohub's world model, it underpins structure prediction and de novo binder design workflows via ESMFold2 and powers exploration of the 6.8-billion-protein ESM Atlas. Distribution partners including AWS Bio Discovery, Benchling, Tamarind Bio, Modal, and SandboxAQ make the model accessible across cloud, lab, and drug-discovery platforms, benefiting protein engineers, computational biologists, and therapeutic discovery teams.

#Impact

The 2026 ESMC release reframes Biohub's protein language modeling as part of a unified, openly licensed system spanning representation, structure, and large-scale annotation. By shipping ESMC, ESMFold2, and the ESM Atlas together under the MIT license, Biohub lowers the barrier to commercial and academic use of frontier protein models and positions the ESM Atlas — covering 6.8 billion sequences and 1.1 billion structures — as one of the largest applications of AI to protein biology to date. Key caveats follow from its status: the supporting paper is a preprint that has not been peer reviewed, the model remains sequence-only with a bounded context window, and lab-validated binder design hit rates, while strong, vary considerably by format.

GitHub

Stars2.6K
Forks316

HuggingFace

Downloads164.6K
Likes6

Openness

Unclassified
Restrictive license on core components

Tags

foundation_modelmasked_language_modelingprotein_designprotein_language_modelproteomicsrepresentation_learningstructure_predictiontransformervariant_effect_prediction

Resources

GitHub RepositoryResearch PaperOfficial WebsiteHuggingFace Model