bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

Reverse Distillation (ESM-2)

Duke University

Reverse-distilled ESM-2 checkpoints (up to 15B) producing Matryoshka-style nested embeddings that scale consistently and reach state of the art on ProteinGym.

Released: March 2026
Parameters: 15 Billion

Protein language models (PLMs) such as ESM-2 do not enjoy the smooth, predictable scaling laws seen in natural language and vision. For many downstream tasks, larger models within the same family plateau or even regress relative to smaller ones, making it hard to know which checkpoint to use and undermining the usual "bigger is better" expectation. Reverse Distillation, introduced by Darius Catrina, Christian Bepler, Samuel Sledzieski, and Rohit Singh at Duke University (ICLR 2026), is a post-hoc method that restores monotonic scaling to ESM-2 representations.

The core idea is to decompose a large PLM's embedding into orthogonal subspaces guided by smaller models of the same family. Smaller models, constrained by capacity, preferentially encode broadly shared protein features; reverse distillation isolates these shared features and then orthogonally appends the additional information that only the larger model captures. The result is a Matryoshka-style nested embedding in which the first k dimensions of a larger model's representation are exactly the embedding of the next-smaller model, so a 15B representation strictly contains the 3B representation, which contains the 650M, and so on.

Because the procedure operates on top of the pretrained ESM-2 family rather than retraining a new backbone, the released checkpoints behave as drop-in replacements for stock ESM-2 in embedding-generation workflows. This is explicitly an ESM-2-specific technique: the distillation is anchored to the ESM-2 8M → 35M → 150M → 650M → 3B → 15B ladder.

#Key Features

  • Consistent scaling: Reverse-distilled variants outperform their stock ESM-2 baselines at matched embedding dimensionality, and larger models reliably beat smaller ones rather than plateauing.
  • Matryoshka nested embeddings: The first k dimensions of a larger model's embedding are identical to the smaller model's representation, so truncating to a chosen budget yields the best model at that dimensionality without recomputation.
  • Orthogonal subspace decomposition: Shared, capacity-robust features are separated from the incremental signal that only larger models encode, clarifying what additional scale actually contributes.
  • Drop-in ESM-2 replacement: The checkpoints slot into existing ESM-2 embedding pipelines, and the package is pip-installable (pip install reverse_distillation).
  • Open weights and code: Pretrained reverse-distillation transforms for every ESM-2 size are released on HuggingFace under the MIT license alongside the GitHub implementation.

#Technical Details

Reverse Distillation is applied across the full ESM-2 transformer ladder (8M, 35M, 150M, 650M, 3B, and 15B parameters), with the 15B reverse-distilled model achieving the strongest performance reported. Rather than training a new network from scratch, the method fits transforms that project each model's embeddings into a nested orthogonal basis defined by the smaller members of the family, preserving the lower-dimensional representation exactly as a prefix of the higher-dimensional one. On the ProteinGym deep mutational scanning benchmark, reverse-distilled ESM-2 variants outperform their respective baselines at the same embedding dimensionality, with the 15B variant reported as state of the art. The released artifacts on HuggingFace (singhlab/plm_reverse_distillation) cover esm2.rd/35M through esm2.rd/15B; each builds on the corresponding facebook/esm2_* base model, so the original ESM-2 weights remain part of the pipeline.

#Applications

The primary application is zero-shot and supervised variant effect prediction, where ProteinGym is the standard yardstick and improved, dimensionality-matched embeddings translate directly into better fitness and stability estimates. More broadly, the nested embeddings benefit any task that consumes ESM-2 features—function and property prediction, similarity search, and clustering—because practitioners can pick an embedding-size budget and obtain the best-performing representation at that size without retraining. Teams already standardized on ESM-2 can adopt the checkpoints with minimal code changes, gaining more reliable scaling behavior across compute budgets.

#Impact

Reverse Distillation targets a well-known frustration in the protein modeling community: that scaling up ESM-2 does not dependably improve downstream performance. By restoring monotonic scaling and packaging the result as nested, truncatable embeddings, the work gives practitioners a principled way to trade off compute against accuracy and to justify using larger checkpoints. As a lightweight, openly licensed post-processing layer over the widely adopted ESM-2 family, it is positioned for easy uptake in existing pipelines. Its main limitation is scope: the method is tied specifically to the ESM-2 family and inherits ESM-2's underlying representational ceiling, so it complements rather than replaces newer foundation-model backbones.

Tags

variant_effect_predictionrepresentation_learningtransformerknowledge_distillationembeddingsproteomics