ProtProfileMD

Helmholtz Munich / Rostlab / Seoul National University

LoRA adapter on ProstT5 predicting per-residue distributions over Foldseek 3Di tokens, capturing conformational flexibility from MD trajectories.

Released: January 2026

ProtProfileMD is a protein language model that predicts conformational flexibility directly from amino acid sequence. Developed by Finn H. Lueth, Michael Heinzinger, and colleagues at Helmholtz Munich together with Burkhard Rost's group (Rostlab, Technical University of Munich) and Martin Steinegger's group at Seoul National University, it was posted to bioRxiv in January 2026. The central idea is that proteins are not static folds: their function often depends on motion, yet most sequence-to-structure models predict a single rigid conformation.

Rather than predict one structure, ProtProfileMD predicts a per-residue probability distribution over Foldseek's 3Di structural alphabet, the same 20-state vocabulary of local structural environments used by its base model, ProstT5. Where ProstT5 translates a sequence into a single 3Di token per residue, ProtProfileMD outputs a full distribution over 3Di states per residue. These distributions are learned from labels derived from molecular dynamics (MD) trajectories, so the spread of predicted states reflects how much each residue moves. The key empirical finding is that the entropy of the predicted 3Di distribution correlates with the structural fluctuations observed in MD, meaning a coarse structural alphabet can encode dynamics.

ProtProfileMD is implemented as a lightweight LoRA (low-rank adaptation) adapter on ProstT5, making it a parameter-efficient extension of an established bilingual sequence-structure model rather than a model trained from scratch. This positions it as a dynamics-aware companion to ProstT5 in the protein language model landscape, with crucially, public code, weights, and training data.

Key Features

Per-residue 3Di distributions: Instead of a single predicted structural token per residue, the model outputs a probability distribution over Foldseek 3Di states, encoding the range of local conformations a residue may adopt.
Entropy as a flexibility signal: The entropy of each per-residue 3Di distribution correlates with structural fluctuations seen in molecular dynamics, providing an interpretable, sequence-derived flexibility score.
Parameter-efficient LoRA on ProstT5: ProtProfileMD is a low-rank adapter fine-tuned on top of ProstT5, inheriting the base model's bilingual sequence-structure knowledge while adding dynamics with minimal added parameters.
MD-derived training labels: Training targets are computed from molecular dynamics trajectories, teaching the model to associate sequence patterns with observed conformational motion.
Improved remote homology detection: The predicted dynamic "FlexProfiles" can be converted into 3Di profiles for Foldseek-based search, enhancing sensitivity in detecting distant relationships and flagging flexible or disordered regions.

Technical Details

ProtProfileMD is a LoRA adapter applied to the ProstT5 encoder-decoder protein language model. It is trained to predict per-residue probability distributions over the 20-state 3Di alphabet, with targets derived from molecular dynamics trajectories. The released dataset (finnlueth/ProtProfileMD on HuggingFace) contains roughly 135,000 examples across train and test splits, with sequences of 50-500 residues annotated with temperature and replica metadata and tokenized for both ProstT5 and ProtT5. The published workflow generates FlexProfiles from FASTA input, converts them to 3Di FASTA, builds Foldseek databases, and runs profile-based searches. The authors report that the entropy of predicted 3Di states tracks MD structural fluctuations and that the dynamic profiles improve remote homology detection sensitivity relative to static representations.

Applications

ProtProfileMD is useful wherever a fast, structure-free estimate of protein flexibility is valuable. By scoring per-residue flexibility from sequence alone, it can flag intrinsically disordered or highly mobile regions without running expensive MD simulations or solving multiple structures. Its FlexProfiles integrate into Foldseek-based remote homology pipelines, helping identify distant evolutionary relationships that static 3Di tokens may miss, and the per-residue entropy can complement structure predictors that report only a single rigid conformation. The MIT-licensed code, the LoRA weights, and the training dataset are all publicly released, making the model directly usable and reproducible.

Impact

ProtProfileMD demonstrates that conformational dynamics, normally accessible only through molecular dynamics or experiment, can be partly captured by a discrete structural alphabet and predicted from sequence with a small adapter on an existing protein language model. By extending ProstT5 from single-structure translation to distributional, dynamics-aware prediction, it offers a practical route to flexibility annotation at proteome scale. Its openness is a notable strength relative to many contemporaneous preprints: the manuscript is CC BY licensed and the code (MIT), adapter weights, and dataset are publicly available. Important caveats remain: the work is a January 2026 preprint that has not yet been peer reviewed, the 3Di alphabet is a coarse approximation of true atomic motion, and the HuggingFace base_model field is mislabeled even though the README clearly identifies ProstT5 as the base.

Citation

Protein Language Modeling beyond static folds reveals sequence-encoded flexibility

Lüth, F. H., et al. (2026) Protein Language Modeling beyond static folds reveals sequence-encoded flexibility. bioRxiv.

DOI: 10.64898/2026.01.21.700698

Recent citations

Papers that recently cited this model.

CryoARC: Atomic-resolution conformational landscapes of protein assemblies from cryo-EM single particles with evolutionary priors
Rémi Vuillemot, S. Grudinin
bioRxiv · May 2026
0
ENSEMBITS: an alphabet of protein conformational ensembles
Kaiwen Shi, Carlos A. Oliver
May 2026
0
Large-scale exploration of protein space by automated NMR
Thomas Müntener, Dylan Abramson, Elsa Stern, et al.
bioRxiv · Feb 2026
1

Top citations

The most-cited papers that cite this model.

Large-scale exploration of protein space by automated NMR
Thomas Müntener, Dylan Abramson, Elsa Stern, et al.
bioRxiv · Feb 2026
1
CryoARC: Atomic-resolution conformational landscapes of protein assemblies from cryo-EM single particles with evolutionary priors
Rémi Vuillemot, S. Grudinin
bioRxiv · May 2026
0
ENSEMBITS: an alphabet of protein conformational ensembles
Kaiwen Shi, Carlos A. Oliver
May 2026
0

Citations

Total Citations3

Influential0

References52

GitHub

Stars36

Forks4

Open Issues1

Contributors1

Last Push5mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes0

Last Modified6mo ago

Fields of citing research

Biology100%
Computer Science67%
Chemistry33%
Materials Science33%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

93Open

Usability — can I run it?95

Reproducibility — can I retrain it?92

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Per-residue 3Di distributions: Instead of a single predicted structural token per residue, the model outputs a probability distribution over Foldseek 3Di states, encoding the range of local conformations a residue may adopt.

Entropy as a flexibility signal: The entropy of each per-residue 3Di distribution correlates with structural fluctuations seen in molecular dynamics, providing an interpretable, sequence-derived flexibility score.

Parameter-efficient LoRA on ProstT5: ProtProfileMD is a low-rank adapter fine-tuned on top of ProstT5, inheriting the base model's bilingual sequence-structure knowledge while adding dynamics with minimal added parameters.

MD-derived training labels: Training targets are computed from molecular dynamics trajectories, teaching the model to associate sequence patterns with observed conformational motion.

Improved remote homology detection: The predicted dynamic "FlexProfiles" can be converted into 3Di profiles for Foldseek-based search, enhancing sensitivity in detecting distant relationships and flagging flexible or disordered regions.

Technical Details

Applications

Impact

ProtProfileMD

#Key Features

#Technical Details

#Applications

#Impact

Citation

Protein Language Modeling beyond static folds reveals sequence-encoded flexibility

Recent citations

ENSEMBITS: an alphabet of protein conformational ensembles

Top citations

ENSEMBITS: an alphabet of protein conformational ensembles

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProtProfileMD

#Key Features

#Technical Details

#Applications

#Impact

Citation

Protein Language Modeling beyond static folds reveals sequence-encoded flexibility

Recent citations

ENSEMBITS: an alphabet of protein conformational ensembles

Top citations

ENSEMBITS: an alphabet of protein conformational ensembles

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact