Vrije Universiteit Brussel / Université libre de Bruxelles
Combines the ProtT5-XL protein language model with protein-specific evolutionary constraints to predict mutational effects on stability, binding, and epistasis—largely zero-shot.
D2D is a framework for predicting the effects of amino-acid substitutions on protein function, developed by Konstantina Tzavella, Adrian Olsen, and Wim Vranken at the Vrije Universiteit Brussel (with Tzavella also affiliated with the Institut Jules Bordet / Université libre de Bruxelles). Introduced in a 2026 bioRxiv preprint, D2D pairs a self-supervised protein language model with protein-specific evolutionary information to score mutational effects on thermostability, binding regions, and both single- and higher-order epistasis— largely without task-specific training.
The central idea is that a general-purpose language model trained on protein sequences captures broad biochemical regularities but misses constraints specific to a given protein family. D2D supplies that missing signal by fitting a Gaussian mixture model (GMM) over per-residue embeddings of a wild-type protein's multiple sequence alignment (MSA), producing "D2D features" that quantify how strongly each position is evolutionarily constrained. Combining these features with the language model yields predictions that remain accurate in disordered and surface-binding regions, where structure-based methods typically degrade.
D2D is the zero-shot generalization of the lab's earlier supervised model D2Deep (Briefings in Bioinformatics, peer-reviewed). D2Deep is a single-task, supervised classifier trained for variant pathogenicity (cancer driver-mutation prediction); D2D lifts the same evolutionary-constraint feature core into a broad, mostly training-free framework spanning multiple biophysical tasks. When fine-tuned for cancer driver prediction, D2D reuses D2Deep's trained classifier and reports state-of-the-art performance on that task.
D2D consumes embeddings from ProtT5-XL, a self-supervised transformer protein language model, as an external dependency loaded from HuggingFace rather than redistributed as its own checkpoint. For a query protein, it builds an MSA, embeds the aligned wild-type sequences, and fits a per-position Gaussian mixture model whose statistics form the D2D features used to score substitutions. The supervised cancer-driver capability is inherited directly from D2Deep's trained weights, which—along with code—are released on GitHub, with associated test data on Zenodo (CC-BY-4.0). Because predictions are built on ProtT5-XL, D2D inherits that model's input ceiling of roughly 2,200 residues. The 2026 D2D manuscript is a preprint and has not yet been peer-reviewed; the D2Deep predecessor is published in Briefings in Bioinformatics.
D2D is aimed at researchers studying how mutations alter protein behavior: estimating stability changes, mapping binding regions, dissecting epistatic interactions in deep mutational scanning data, and—via the supervised D2Deep mode—prioritizing candidate cancer driver mutations. Its independence from experimental structure makes it especially useful for intrinsically disordered proteins and surface-binding interfaces that resist structure-based analysis. The D2Deep predecessor is accessible through a web server at tumorscope.be/d2deep; note that this serves the supervised variant, as no dedicated D2D server exists yet.
D2D illustrates a practical recipe for protein variant-effect prediction: augment a general protein language model with cheaply computed, protein-specific evolutionary statistics instead of retraining large models per task. By generalizing the peer-reviewed D2Deep classifier into a zero-shot, multi-task framework, the work extends a validated pathogenicity tool toward stability, binding, and epistasis while highlighting performance in the disordered and surface regions that structure-based methods handle poorly. As a preprint that relies on an external ProtT5-XL dependency and D2Deep's supervised weights for its cancer-driver mode, its long-term influence will depend on peer review and broader benchmarking, but it offers a lightweight, openly coded path to evolution-aware mutation scoring.
Tzavella, K., et al. (2026) Constrained protein Large Language Model illustrated in protein stability, function and epistasis. bioRxiv.
DOI: 10.64898/2026.05.22.726784Tzavella, K., et al. (2024) Combining evolution and protein language models for an interpretable cancer driver mutation prediction with D2Deep. bioRxiv.
DOI: 10.1093/bib/bbae664