bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

ProtProfileMD

Helmholtz Munich / Rostlab / Seoul National University

A LoRA adapter on ProstT5 that predicts per-residue probability distributions over Foldseek 3Di tokens, capturing sequence-encoded conformational flexibility from MD trajectories.

Released: January 2026

ProtProfileMD is a protein language model that predicts conformational flexibility directly from amino acid sequence. Developed by Finn H. Lueth, Michael Heinzinger, and colleagues at Helmholtz Munich together with Burkhard Rost's group (Rostlab, Technical University of Munich) and Martin Steinegger's group at Seoul National University, it was posted to bioRxiv in January 2026. The central idea is that proteins are not static folds: their function often depends on motion, yet most sequence-to-structure models predict a single rigid conformation.

Rather than predict one structure, ProtProfileMD predicts a per-residue probability distribution over Foldseek's 3Di structural alphabet, the same 20-state vocabulary of local structural environments used by its base model, ProstT5. Where ProstT5 translates a sequence into a single 3Di token per residue, ProtProfileMD outputs a full distribution over 3Di states per residue. These distributions are learned from labels derived from molecular dynamics (MD) trajectories, so the spread of predicted states reflects how much each residue moves. The key empirical finding is that the entropy of the predicted 3Di distribution correlates with the structural fluctuations observed in MD, meaning a coarse structural alphabet can encode dynamics.

ProtProfileMD is implemented as a lightweight LoRA (low-rank adaptation) adapter on ProstT5, making it a parameter-efficient extension of an established bilingual sequence-structure model rather than a model trained from scratch. This positions it as a dynamics-aware companion to ProstT5 in the protein language model landscape, with crucially, public code, weights, and training data.

#Key Features

  • Per-residue 3Di distributions: Instead of a single predicted structural token per residue, the model outputs a probability distribution over Foldseek 3Di states, encoding the range of local conformations a residue may adopt.
  • Entropy as a flexibility signal: The entropy of each per-residue 3Di distribution correlates with structural fluctuations seen in molecular dynamics, providing an interpretable, sequence-derived flexibility score.
  • Parameter-efficient LoRA on ProstT5: ProtProfileMD is a low-rank adapter fine-tuned on top of ProstT5, inheriting the base model's bilingual sequence-structure knowledge while adding dynamics with minimal added parameters.
  • MD-derived training labels: Training targets are computed from molecular dynamics trajectories, teaching the model to associate sequence patterns with observed conformational motion.
  • Improved remote homology detection: The predicted dynamic "FlexProfiles" can be converted into 3Di profiles for Foldseek-based search, enhancing sensitivity in detecting distant relationships and flagging flexible or disordered regions.

#Technical Details

ProtProfileMD is a LoRA adapter applied to the ProstT5 encoder-decoder protein language model. It is trained to predict per-residue probability distributions over the 20-state 3Di alphabet, with targets derived from molecular dynamics trajectories. The released dataset (finnlueth/ProtProfileMD on HuggingFace) contains roughly 135,000 examples across train and test splits, with sequences of 50-500 residues annotated with temperature and replica metadata and tokenized for both ProstT5 and ProtT5. The published workflow generates FlexProfiles from FASTA input, converts them to 3Di FASTA, builds Foldseek databases, and runs profile-based searches. The authors report that the entropy of predicted 3Di states tracks MD structural fluctuations and that the dynamic profiles improve remote homology detection sensitivity relative to static representations.

#Applications

ProtProfileMD is useful wherever a fast, structure-free estimate of protein flexibility is valuable. By scoring per-residue flexibility from sequence alone, it can flag intrinsically disordered or highly mobile regions without running expensive MD simulations or solving multiple structures. Its FlexProfiles integrate into Foldseek-based remote homology pipelines, helping identify distant evolutionary relationships that static 3Di tokens may miss, and the per-residue entropy can complement structure predictors that report only a single rigid conformation. The MIT-licensed code, the LoRA weights, and the training dataset are all publicly released, making the model directly usable and reproducible.

#Impact

ProtProfileMD demonstrates that conformational dynamics, normally accessible only through molecular dynamics or experiment, can be partly captured by a discrete structural alphabet and predicted from sequence with a small adapter on an existing protein language model. By extending ProstT5 from single-structure translation to distributional, dynamics-aware prediction, it offers a practical route to flexibility annotation at proteome scale. Its openness is a notable strength relative to many contemporaneous preprints: the manuscript is CC BY licensed and the code (MIT), adapter weights, and dataset are publicly available. Important caveats remain: the work is a January 2026 preprint that has not yet been peer reviewed, the 3Di alphabet is a coarse approximation of true atomic motion, and the HuggingFace base_model field is mislabeled even though the README clearly identifies ProstT5 as the base.

GitHub

Stars35
Forks4
Open Issues1
Contributors1
Last Push4mo ago
LanguagePython
LicenseMIT

HuggingFace

Downloads0
Likes0
Last Modified4mo ago

Openness

bio.rodeo opennessFully open · usable and reproducible
93Open
Usability — can I run it?95
Reproducibility — can I retrain it?92
Model Openness Framework
Unclassified
Missing required components

Tags

variant_effect_predictionremote_homology_detectionstructure_predictiontransformertransfer_learninglanguage_modelself_supervisedproteomicsmolecular_dynamics

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset