Reverse Distillation (ESM-2)

Post-hoc method that restores monotonic scaling to ESM-2 embeddings, yielding Matryoshka-style nested representations for variant effect prediction.

Released: March 2026

Parameters: 15 Billion

Protein language models (PLMs) such as ESM-2 do not enjoy the smooth, predictable scaling laws seen in natural language and vision. For many downstream tasks, larger models within the same family plateau or even regress relative to smaller ones, making it hard to know which checkpoint to use and undermining the usual "bigger is better" expectation. Reverse Distillation, introduced by Darius Catrina, Christian Bepler, Samuel Sledzieski, and Rohit Singh at Duke University (ICLR 2026), is a post-hoc method that restores monotonic scaling to ESM-2 representations.

The core idea is to decompose a large PLM's embedding into orthogonal subspaces guided by smaller models of the same family. Smaller models, constrained by capacity, preferentially encode broadly shared protein features; reverse distillation isolates these shared features and then orthogonally appends the additional information that only the larger model captures. The result is a Matryoshka-style nested embedding in which the first k dimensions of a larger model's representation are exactly the embedding of the next-smaller model, so a 15B representation strictly contains the 3B representation, which contains the 650M, and so on.

Because the procedure operates on top of the pretrained ESM-2 family rather than retraining a new backbone, the released checkpoints behave as drop-in replacements for stock ESM-2 in embedding-generation workflows. This is explicitly an ESM-2-specific technique: the distillation is anchored to the ESM-2 8M → 35M → 150M → 650M → 3B → 15B ladder.

Key Features

Consistent scaling: Reverse-distilled variants outperform their stock ESM-2 baselines at matched embedding dimensionality, and larger models reliably beat smaller ones rather than plateauing.
Matryoshka nested embeddings: The first k dimensions of a larger model's embedding are identical to the smaller model's representation, so truncating to a chosen budget yields the best model at that dimensionality without recomputation.
Orthogonal subspace decomposition: Shared, capacity-robust features are separated from the incremental signal that only larger models encode, clarifying what additional scale actually contributes.
Drop-in ESM-2 replacement: The checkpoints slot into existing ESM-2 embedding pipelines, and the package is pip-installable (pip install reverse_distillation).
Open weights and code: Pretrained reverse-distillation transforms for every ESM-2 size are released on HuggingFace under the MIT license alongside the GitHub implementation.

Technical Details

Reverse Distillation is applied across the full ESM-2 transformer ladder (8M, 35M, 150M, 650M, 3B, and 15B parameters), with the 15B reverse-distilled model achieving the strongest performance reported. Rather than training a new network from scratch, the method fits transforms that project each model's embeddings into a nested orthogonal basis defined by the smaller members of the family, preserving the lower-dimensional representation exactly as a prefix of the higher-dimensional one. On the ProteinGym deep mutational scanning benchmark, reverse-distilled ESM-2 variants outperform their respective baselines at the same embedding dimensionality, with the 15B variant reported as state of the art. The released artifacts on HuggingFace (singhlab/plm_reverse_distillation) cover esm2.rd/35M through esm2.rd/15B; each builds on the corresponding facebook/esm2_* base model, so the original ESM-2 weights remain part of the pipeline.

Applications

The primary application is zero-shot and supervised variant effect prediction, where ProteinGym is the standard yardstick and improved, dimensionality-matched embeddings translate directly into better fitness and stability estimates. More broadly, the nested embeddings benefit any task that consumes ESM-2 features—function and property prediction, similarity search, and clustering—because practitioners can pick an embedding-size budget and obtain the best-performing representation at that size without retraining. Teams already standardized on ESM-2 can adopt the checkpoints with minimal code changes, gaining more reliable scaling behavior across compute budgets.

Impact

Reverse Distillation targets a well-known frustration in the protein modeling community: that scaling up ESM-2 does not dependably improve downstream performance. By restoring monotonic scaling and packaging the result as nested, truncatable embeddings, the work gives practitioners a principled way to trade off compute against accuracy and to justify using larger checkpoints. As a lightweight, openly licensed post-processing layer over the widely adopted ESM-2 family, it is positioned for easy uptake in existing pipelines. Its main limitation is scope: the method is tied specifically to the ESM-2 family and inherits ESM-2's underlying representational ceiling, so it complements rather than replaces newer foundation-model backbones.

Citation

Reverse Distillation: Consistently Scaling Protein Language Model Representations

Preprint

Catrina, D., et al. (2026) Reverse Distillation: Consistently Scaling Protein Language Model Representations.

DOI: 10.48550/arXiv.2603.07710

Recent citations

Papers that recently cited this model.

Breaking the Synthesis Barrier for AI-Designed DNA Libraries
Scott Sussex, Ema Borevković, F. Lohmann, et al.
bioRxiv · Jul 2026
0
Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models
Mingqing Wang, Meng Yuan, Athanasios V. Vasilakos, et al.
bioRxiv · May 2026
0

Top citations

The most-cited papers that cite this model.

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models
Mingqing Wang, Meng Yuan, Athanasios V. Vasilakos, et al.
bioRxiv · May 2026
0
Breaking the Synthesis Barrier for AI-Designed DNA Libraries
Scott Sussex, Ema Borevković, F. Lohmann, et al.
bioRxiv · Jul 2026
0

Citations

Total Citations2

GitHub

Stars4

Forks1

HuggingFace

Downloads13

Likes0

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

58Partial

Usability — can I run it?92

Reproducibility — can I retrain it?29

open weights, closed recipe

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Consistent scaling: Reverse-distilled variants outperform their stock ESM-2 baselines at matched embedding dimensionality, and larger models reliably beat smaller ones rather than plateauing.

Matryoshka nested embeddings: The first k dimensions of a larger model's embedding are identical to the smaller model's representation, so truncating to a chosen budget yields the best model at that dimensionality without recomputation.

Orthogonal subspace decomposition: Shared, capacity-robust features are separated from the incremental signal that only larger models encode, clarifying what additional scale actually contributes.

Drop-in ESM-2 replacement: The checkpoints slot into existing ESM-2 embedding pipelines, and the package is pip-installable (pip install reverse_distillation).

Open weights and code: Pretrained reverse-distillation transforms for every ESM-2 size are released on HuggingFace under the MIT license alongside the GitHub implementation.

Technical Details

Applications

Impact

Reverse Distillation (ESM-2)

Key Features

Technical Details

Applications

Impact

Citation

Reverse Distillation: Consistently Scaling Protein Language Model Representations

Recent citations

Breaking the Synthesis Barrier for AI-Designed DNA Libraries

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

Top citations

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

Breaking the Synthesis Barrier for AI-Designed DNA Libraries

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Reverse Distillation (ESM-2)

Key Features

Technical Details

Applications

Impact

Citation

Reverse Distillation: Consistently Scaling Protein Language Model Representations

Recent citations

Breaking the Synthesis Barrier for AI-Designed DNA Libraries

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

Top citations

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

Breaking the Synthesis Barrier for AI-Designed DNA Libraries

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Reverse Distillation (ESM-2)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Reverse Distillation: Consistently Scaling Protein Language Model Representations

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Reverse Distillation (ESM-2)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Reverse Distillation: Consistently Scaling Protein Language Model Representations

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact