D2D

Vrije Universiteit Brussel / Université libre de Bruxelles

Variant effect predictor pairing a protein language model with family-specific evolutionary constraints to score stability, binding, and epistasis.

Released: May 2026

D2D is a framework for predicting the effects of amino-acid substitutions on protein function, developed by Konstantina Tzavella, Adrian Olsen, and Wim Vranken at the Vrije Universiteit Brussel (with Tzavella also affiliated with the Institut Jules Bordet / Université libre de Bruxelles). Introduced in a 2026 bioRxiv preprint, D2D pairs a self-supervised protein language model with protein-specific evolutionary information to score mutational effects on thermostability, binding regions, and both single- and higher-order epistasis— largely without task-specific training.

The central idea is that a general-purpose language model trained on protein sequences captures broad biochemical regularities but misses constraints specific to a given protein family. D2D supplies that missing signal by fitting a Gaussian mixture model (GMM) over per-residue embeddings of a wild-type protein's multiple sequence alignment (MSA), producing "D2D features" that quantify how strongly each position is evolutionarily constrained. Combining these features with the language model yields predictions that remain accurate in disordered and surface-binding regions, where structure-based methods typically degrade.

D2D is the zero-shot generalization of the lab's earlier supervised model D2Deep (Briefings in Bioinformatics, peer-reviewed). D2Deep is a single-task, supervised classifier trained for variant pathogenicity (cancer driver-mutation prediction); D2D lifts the same evolutionary-constraint feature core into a broad, mostly training-free framework spanning multiple biophysical tasks. When fine-tuned for cancer driver prediction, D2D reuses D2Deep's trained classifier and reports state-of-the-art performance on that task.

Key Features

Evolutionary constraints over LLM embeddings: A GMM fit to wild-type MSA embeddings (the "D2D features") injects protein-specific conservation signal that a generic language model alone does not capture.
Zero-shot across tasks: Stability, binding-region, and epistasis predictions require no task-specific training, making the method broadly applicable out of the box.
Robust in disordered and surface regions: Because it does not depend on a resolved 3D structure, D2D remains effective in intrinsically disordered and surface-binding regions where structure-based predictors struggle.
Higher-order epistasis: Captures both single-mutation effects and interactions among multiple substitutions, not just additive single-site scores.
Optional supervised cancer-driver mode: When fine-tuned, it reuses the peer-reviewed D2Deep classifier to predict driver mutations at state-of-the-art accuracy.

Technical Details

D2D consumes embeddings from ProtT5-XL, a self-supervised transformer protein language model, as an external dependency loaded from HuggingFace rather than redistributed as its own checkpoint. For a query protein, it builds an MSA, embeds the aligned wild-type sequences, and fits a per-position Gaussian mixture model whose statistics form the D2D features used to score substitutions. The supervised cancer-driver capability is inherited directly from D2Deep's trained weights, which—along with code—are released on GitHub, with associated test data on Zenodo (CC-BY-4.0). Because predictions are built on ProtT5-XL, D2D inherits that model's input ceiling of roughly 2,200 residues. The 2026 D2D manuscript is a preprint and has not yet been peer-reviewed; the D2Deep predecessor is published in Briefings in Bioinformatics.

Applications

D2D is aimed at researchers studying how mutations alter protein behavior: estimating stability changes, mapping binding regions, dissecting epistatic interactions in deep mutational scanning data, and—via the supervised D2Deep mode—prioritizing candidate cancer driver mutations. Its independence from experimental structure makes it especially useful for intrinsically disordered proteins and surface-binding interfaces that resist structure-based analysis. The D2Deep predecessor is accessible through a web server at tumorscope.be/d2deep; note that this serves the supervised variant, as no dedicated D2D server exists yet.

Impact

D2D illustrates a practical recipe for protein variant-effect prediction: augment a general protein language model with cheaply computed, protein-specific evolutionary statistics instead of retraining large models per task. By generalizing the peer-reviewed D2Deep classifier into a zero-shot, multi-task framework, the work extends a validated pathogenicity tool toward stability, binding, and epistasis while highlighting performance in the disordered and surface regions that structure-based methods handle poorly. As a preprint that relies on an external ProtT5-XL dependency and D2Deep's supervised weights for its cancer-driver mode, its long-term influence will depend on peer review and broader benchmarking, but it offers a lightweight, openly coded path to evolution-aware mutation scoring.

Citations

Constrained protein Large Language Model illustrated in protein stability, function and epistasis

Tzavella, K., et al. (2026) Constrained protein Large Language Model illustrated in protein stability, function and epistasis. bioRxiv.

DOI: 10.64898/2026.05.22.726784

Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

Tzavella, K., et al. (2024) Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep. bioRxiv.

DOI: 10.1093/bib/bbae664

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References0

GitHub

Stars1

Forks0

Open Issues0

Contributors1

Last Push1mo ago

LanguageJupyter Notebook

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

29Closed

Usability — can I run it?13

Reproducibility — can I retrain it?43

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository GitHub Repository Demo Dataset

Key Features

Evolutionary constraints over LLM embeddings: A GMM fit to wild-type MSA embeddings (the "D2D features") injects protein-specific conservation signal that a generic language model alone does not capture.

Zero-shot across tasks: Stability, binding-region, and epistasis predictions require no task-specific training, making the method broadly applicable out of the box.

Robust in disordered and surface regions: Because it does not depend on a resolved 3D structure, D2D remains effective in intrinsically disordered and surface-binding regions where structure-based predictors struggle.

Higher-order epistasis: Captures both single-mutation effects and interactions among multiple substitutions, not just additive single-site scores.

Optional supervised cancer-driver mode: When fine-tuned, it reuses the peer-reviewed D2Deep classifier to predict driver mutations at state-of-the-art accuracy.

Technical Details

Applications

Impact

Citations

Constrained protein Large Language Model illustrated in protein stability, function and epistasis

Tzavella, K., et al. (2026) Constrained protein Large Language Model illustrated in protein stability, function and epistasis. bioRxiv.

DOI: 10.64898/2026.05.22.726784

Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

Tzavella, K., et al. (2024) Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep. bioRxiv.

DOI: 10.1093/bib/bbae664

D2D

Key Features

Technical Details

Applications

Impact

Citations

Constrained protein Large Language Model illustrated in protein stability, function and epistasis

Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

D2D

Key Features

Technical Details

Applications

Impact

Citations

Constrained protein Large Language Model illustrated in protein stability, function and epistasis

Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

D2D

#Key Features

#Technical Details

#Applications

#Impact

Citations

Constrained protein Large Language Model illustrated in protein stability, function and epistasis

Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

D2D

#Key Features

#Technical Details

#Applications

#Impact

Citations

Constrained protein Large Language Model illustrated in protein stability, function and epistasis

Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact