The Hong Kong Polytechnic University / genbio.ai / Mohamed bin Zayed University of Artificial Intelligence
A hybrid-diffusion protein language model that adds a continuous-token diffusion head to a discrete pLM for joint sequence-structure modeling.
HD-Prot is a multimodal protein language model that jointly models protein sequence and structure within a single architecture. Proteins have an inherent sequence-structure duality, and while sequence data is abundant and naturally expressed as discrete tokens, structure is continuous and three-dimensional. Most multimodal protein language models reconcile this mismatch by discretizing structure with a vector-quantized codebook (as in ESM3 and DPLM-2), which loses fine-grained geometric detail. HD-Prot's central argument is that this loss is avoidable: a sequence-based pLM can be extended to the structure modality using continuous tokens — high-fidelity structure latents that skip quantization entirely.
To do this, HD-Prot places a continuous-valued diffusion "head" on top of a discrete protein language model. The model operates over a mixed stream of discrete sequence tokens and continuous structure tokens, tying them together through a single absorbing diffusion process. At each token, it either predicts a categorical distribution (for amino-acid identity) or runs a small continuous diffusion sampler (for the structure latent), so both modalities are estimated inside one unified language-model backbone.
The work comes from researchers at The Hong Kong Polytechnic University, GenBio AI, and the Mohamed bin Zayed University of Artificial Intelligence, posted to arXiv in December 2025 and accepted to KDD 2026. A notable framing is efficiency: the authors report matching state-of-the-art multimodal pLMs while using less than one-tenth the compute budget for the modality-extension fine-tuning stage.
HD-Prot extends a sequence-pretrained discrete pLM and is evaluated at roughly 155M and 670M parameter scales. Structure latents come from the salad autoencoder (Jendrusch & Korbel, 2025); the modality-extension stage was trained on approximately 210K filtered protein structures, following the data setup used by DPLM-2. On reported benchmarks for the 670M model, unconditional co-generation at 300 residues reaches pLDDT 81.1, self-consistency RMSD 4.9 Å, and scTM 0.878; motif-scaffolding solves 19.4 of 24 tasks (24.1% success rate); structure prediction on CAMEO reaches 7.47 Å RMSD and 0.769 TM-score; and inverse folding yields scRMSD 4.7 Å with scTM 0.866. These results place HD-Prot on par with state-of-the-art multimodal pLMs despite the reduced training budget.
HD-Prot targets protein design and analysis workflows that benefit from joint reasoning over sequence and structure. Unconditional co-generation produces novel sequence-structure pairs for de novo design; motif-scaffolding builds new proteins around a fixed functional motif; structure prediction folds a given sequence; and inverse folding designs sequences for a target backbone. Researchers in computational protein engineering and generative biology can use a single model across these tasks rather than maintaining separate specialized pipelines.
HD-Prot offers evidence that multimodal protein language models can incorporate structure through continuous tokens instead of quantized codebooks, preserving geometric detail while remaining compatible with the language-modeling framework. By demonstrating that categorical and continuous distributions can be estimated together in one architecture — and doing so under a small compute budget — it points to a practical alternative direction for multimodal pLM design. Limitations include the modest training-set size (~210K structures) and that, at the time of the preprint, pretrained weights had not yet been publicly released: code and preprocessed data are on GitHub, while checkpoints (hdprot_155m and hdprot_670m) were stated to be released upon paper acceptance, with availability on request in the interim.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data