A family of protein language models (300M, 600M, 6B parameters) for representation learning that substantially outperforms ESM-2 at equivalent or smaller scale.
ESM Cambrian (ESM-C) is a next-generation family of protein language models released by EvolutionaryScale in December 2024. It is a parallel model family to the ESM3 generative models, but where ESM3 focuses on controllable protein generation, ESM-C is purpose-built for learning high-quality representations of protein biology. Trained at three scales — 300M, 600M, and 6B parameters — ESM-C delivers dramatic efficiency gains over its predecessor ESM-2, with the 6B model substantially outperforming ESM-2 15B while being considerably cheaper to run.
The model series addresses a gap that had emerged in the protein language model landscape: the need for a dedicated representation model optimized for downstream tasks rather than generation. Prior ESM models (ESM-1, ESM-2) served this role but were trained with older architectural choices. ESM-C incorporates modern design patterns — rotary positional embeddings, SwiGLU activations, and pre-LayerNorm — that have since become standard in large language model development, resulting in more efficient use of model capacity.
ESM Cambrian was released alongside open weights for the 300M and 600M variants on HuggingFace, with the 6B model accessible through EvolutionaryScale's Forge platform and AWS SageMaker. A blog post from EvolutionaryScale accompanied the release; a peer-reviewed paper specifically describing ESM-C was not available at time of publication, though architectural context is provided by the closely related ESM3 paper (Hayes et al., 2025, Science).
ESM-C uses a standard transformer encoder architecture with several modern design choices: Pre-LayerNorm (Pre-LN) normalization for improved training stability, Rotary Position Embeddings (RoPE) instead of absolute positional encodings, SwiGLU activations in feed-forward layers, and no biases in linear layers or layer norms. The training objective is masked language modeling (MLM), predicting masked amino acid tokens from surrounding sequence context. The three model sizes (300M, 600M, 6B) share this architectural template and differ in depth and hidden dimensionality.
Training data is drawn from three sources: UniRef (clustered at 70% sequence identity, approximately 83 million representative sequences), MGnify (approximately 372 million sequence clusters from EMBL-EBI metagenomics), and JGI (approximately 2 billion clusters from large-scale environmental sequencing). Training proceeds in two stages: stage one at 512-residue context length for approximately 1 million steps with metagenomic sequences comprising 64% of each batch, and stage two at 2048-residue context length for approximately 500,000 steps with reduced metagenomics weighting (37.5%). On contact prediction benchmarks evaluated on CASP15 targets, the 6B model substantially outperforms ESM-2 15B on contact precision (P@L) despite having fewer parameters.
ESM-C is designed as a general-purpose protein sequence encoder for researchers who require high-quality embeddings without the overhead of a generative model. Core use cases include fitness and variant effect prediction (embedding wild-type and mutant sequences to predict the consequence of amino acid substitutions), protein function classification (using pooled embeddings as input features for GO term, EC number, or subcellular localization classifiers), and retrieval and clustering (identifying functionally similar proteins across large databases). Per-residue representations support contact and secondary structure prediction tasks. The 300M and 600M open-weight models are particularly well-suited for labs that need to embed large sequence datasets in resource-constrained environments or integrate ESM-C as a pre-trained backbone in custom fine-tuning pipelines.
ESM-C represents a significant step in making powerful protein language model embeddings accessible at multiple scales and price points. By separating representation learning from generative modeling, EvolutionaryScale provides practitioners with a cleaner tool for the large class of tasks that require embeddings rather than sequences. The efficiency improvements over ESM-2 lower the barrier for groups without access to large GPU clusters. Notable limitations include the sequence-only input (no structural information, MSAs, or other modalities), a 2048-residue context length cap that requires truncation strategies for long proteins, and the non-commercial licensing restriction on the 600M model. ESM-C is not intended for protein design or conditional generation; the ESM3 model family serves those applications. The release of open weights for two of the three model sizes, including one with commercial permissions, sets a positive precedent for accessibility in the protein foundation model ecosystem.