HD-Prot

The Hong Kong Polytechnic University / genbio.ai / Mohamed bin Zayed University of Artificial Intelligence

Multimodal protein language model that adds a continuous-token diffusion head to a discrete pLM, modeling structure without vector quantization.

Released: December 2025

HD-Prot is a multimodal protein language model that jointly models protein sequence and structure within a single architecture. Proteins have an inherent sequence-structure duality, and while sequence data is abundant and naturally expressed as discrete tokens, structure is continuous and three-dimensional. Most multimodal protein language models reconcile this mismatch by discretizing structure with a vector-quantized codebook (as in ESM3 and DPLM-2), which loses fine-grained geometric detail. HD-Prot's central argument is that this loss is avoidable: a sequence-based pLM can be extended to the structure modality using continuous tokens — high-fidelity structure latents that skip quantization entirely.

To do this, HD-Prot places a continuous-valued diffusion "head" on top of a discrete protein language model. The model operates over a mixed stream of discrete sequence tokens and continuous structure tokens, tying them together through a single absorbing diffusion process. At each token, it either predicts a categorical distribution (for amino-acid identity) or runs a small continuous diffusion sampler (for the structure latent), so both modalities are estimated inside one unified language-model backbone.

The work comes from researchers at The Hong Kong Polytechnic University, GenBio AI, and the Mohamed bin Zayed University of Artificial Intelligence, posted to arXiv in December 2025 and accepted to KDD 2026. A notable framing is efficiency: the authors report matching state-of-the-art multimodal pLMs while using less than one-tenth the compute budget for the modality-extension fine-tuning stage.

Key Features

Continuous structure tokens: Structures are encoded as non-quantized latents via the salad protein autoencoder (a sparse invariant point attention design, latent dimension 20), avoiding the information loss of VQ-VAE codebooks used by prior multimodal pLMs.
Hybrid diffusion head: A continuous diffusion module is mounted on a discrete pLM so the same model emits categorical predictions for sequence and continuous diffusion samples for structure.
Unified absorbing diffusion: A single absorbing-state diffusion process captures inter-token dependencies across both modalities rather than training separate sequence and structure models.
Multi-task capability: One trained model handles unconditional sequence-structure co-generation, motif-scaffolding, structure prediction, and inverse folding.
Compute-efficient adaptation: The modality-extension fine-tuning reportedly uses under one-tenth the budget of comparable SOTA multimodal pLMs.

Technical Details

HD-Prot extends a sequence-pretrained discrete pLM and is evaluated at roughly 155M and 670M parameter scales. Structure latents come from the salad autoencoder (Jendrusch & Korbel, 2025); the modality-extension stage was trained on approximately 210K filtered protein structures, following the data setup used by DPLM-2. On reported benchmarks for the 670M model, unconditional co-generation at 300 residues reaches pLDDT 81.1, self-consistency RMSD 4.9 Å, and scTM 0.878; motif-scaffolding solves 19.4 of 24 tasks (24.1% success rate); structure prediction on CAMEO reaches 7.47 Å RMSD and 0.769 TM-score; and inverse folding yields scRMSD 4.7 Å with scTM 0.866. These results place HD-Prot on par with state-of-the-art multimodal pLMs despite the reduced training budget.

Applications

HD-Prot targets protein design and analysis workflows that benefit from joint reasoning over sequence and structure. Unconditional co-generation produces novel sequence-structure pairs for de novo design; motif-scaffolding builds new proteins around a fixed functional motif; structure prediction folds a given sequence; and inverse folding designs sequences for a target backbone. Researchers in computational protein engineering and generative biology can use a single model across these tasks rather than maintaining separate specialized pipelines.

Impact

HD-Prot offers evidence that multimodal protein language models can incorporate structure through continuous tokens instead of quantized codebooks, preserving geometric detail while remaining compatible with the language-modeling framework. By demonstrating that categorical and continuous distributions can be estimated together in one architecture — and doing so under a small compute budget — it points to a practical alternative direction for multimodal pLM design. Limitations include the modest training-set size (~210K structures) and that, at the time of the preprint, pretrained weights had not yet been publicly released: code and preprocessed data are on GitHub, while checkpoints (hdprot_155m and hdprot_670m) were stated to be released upon paper acceptance, with availability on request in the interim.

Citation

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Preprint

Zhou, Y., et al. (2025) HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens. arXiv.org.

DOI: 10.48550/arXiv.2512.15133

Recent citations

Papers that recently cited this model.

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong
Richard Chan, Shanru Lin, Yawei Ma, et al.
Apr 2026
0
Enhancing Molecular Property Predictions by Learning from Bond Modelling and Interactions
Yunqing Liu, Yi Zhou, Wenqi Fan
arXiv.org · Feb 2026
3
DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis
Rui An, Haohao Qu, Wenqi Fan, et al.
arXiv.org · Jan 2026
0

Top citations

The most-cited papers that cite this model.

A Survey of Mamba
Haohao Qu, Liang-bo Ning, Rui An, et al.
arXiv.org · Aug 2024
94
Enhancing Molecular Property Predictions by Learning from Bond Modelling and Interactions
Yunqing Liu, Yi Zhou, Wenqi Fan
arXiv.org · Feb 2026
3
PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong
Richard Chan, Shanru Lin, Yawei Ma, et al.
Apr 2026
0
DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis
Rui An, Haohao Qu, Wenqi Fan, et al.
arXiv.org · Jan 2026
0

Citations

Total Citations4

Influential0

References57

GitHub

Stars7

Forks0

Open Issues0

Contributors1

Last Push1mo ago

LanguageC++

Fields of citing research

Computer Science100%
Chemistry25%
Medicine25%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

14Closed

Usability — can I run it?11

Reproducibility — can I retrain it?18

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper

Key Features

Continuous structure tokens: Structures are encoded as non-quantized latents via the salad protein autoencoder (a sparse invariant point attention design, latent dimension 20), avoiding the information loss of VQ-VAE codebooks used by prior multimodal pLMs.

Hybrid diffusion head: A continuous diffusion module is mounted on a discrete pLM so the same model emits categorical predictions for sequence and continuous diffusion samples for structure.

Unified absorbing diffusion: A single absorbing-state diffusion process captures inter-token dependencies across both modalities rather than training separate sequence and structure models.

Multi-task capability: One trained model handles unconditional sequence-structure co-generation, motif-scaffolding, structure prediction, and inverse folding.

Compute-efficient adaptation: The modality-extension fine-tuning reportedly uses under one-tenth the budget of comparable SOTA multimodal pLMs.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Top citations

The most-cited papers that cite this model.

A Survey of Mamba

Haohao Qu, Liang-bo Ning, Rui An, et al.

arXiv.org · Aug 2024

Enhancing Molecular Property Predictions by Learning from Bond Modelling and Interactions

Yunqing Liu, Yi Zhou, Wenqi Fan

arXiv.org · Feb 2026

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

Richard Chan, Shanru Lin, Yawei Ma, et al.

Apr 2026

DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis

Rui An, Haohao Qu, Wenqi Fan, et al.

arXiv.org · Jan 2026

HD-Prot

#Key Features

#Technical Details

#Applications

#Impact

Citation

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Recent citations

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

Top citations

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

HD-Prot

#Key Features

#Technical Details

#Applications

#Impact

Citation

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Recent citations

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

Top citations

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact