Helmholtz Munich / Rostlab / Seoul National University
A LoRA adapter on ProstT5 that predicts per-residue probability distributions over Foldseek 3Di tokens, capturing sequence-encoded conformational flexibility from MD trajectories.
ProtProfileMD is a protein language model that predicts conformational flexibility directly from amino acid sequence. Developed by Finn H. Lueth, Michael Heinzinger, and colleagues at Helmholtz Munich together with Burkhard Rost's group (Rostlab, Technical University of Munich) and Martin Steinegger's group at Seoul National University, it was posted to bioRxiv in January 2026. The central idea is that proteins are not static folds: their function often depends on motion, yet most sequence-to-structure models predict a single rigid conformation.
Rather than predict one structure, ProtProfileMD predicts a per-residue probability distribution over Foldseek's 3Di structural alphabet, the same 20-state vocabulary of local structural environments used by its base model, ProstT5. Where ProstT5 translates a sequence into a single 3Di token per residue, ProtProfileMD outputs a full distribution over 3Di states per residue. These distributions are learned from labels derived from molecular dynamics (MD) trajectories, so the spread of predicted states reflects how much each residue moves. The key empirical finding is that the entropy of the predicted 3Di distribution correlates with the structural fluctuations observed in MD, meaning a coarse structural alphabet can encode dynamics.
ProtProfileMD is implemented as a lightweight LoRA (low-rank adaptation) adapter on ProstT5, making it a parameter-efficient extension of an established bilingual sequence-structure model rather than a model trained from scratch. This positions it as a dynamics-aware companion to ProstT5 in the protein language model landscape, with crucially, public code, weights, and training data.
ProtProfileMD is a LoRA adapter applied to the ProstT5 encoder-decoder protein language model. It is trained to predict per-residue probability distributions over the 20-state 3Di alphabet, with targets derived from molecular dynamics trajectories. The released dataset (finnlueth/ProtProfileMD on HuggingFace) contains roughly 135,000 examples across train and test splits, with sequences of 50-500 residues annotated with temperature and replica metadata and tokenized for both ProstT5 and ProtT5. The published workflow generates FlexProfiles from FASTA input, converts them to 3Di FASTA, builds Foldseek databases, and runs profile-based searches. The authors report that the entropy of predicted 3Di states tracks MD structural fluctuations and that the dynamic profiles improve remote homology detection sensitivity relative to static representations.
ProtProfileMD is useful wherever a fast, structure-free estimate of protein flexibility is valuable. By scoring per-residue flexibility from sequence alone, it can flag intrinsically disordered or highly mobile regions without running expensive MD simulations or solving multiple structures. Its FlexProfiles integrate into Foldseek-based remote homology pipelines, helping identify distant evolutionary relationships that static 3Di tokens may miss, and the per-residue entropy can complement structure predictors that report only a single rigid conformation. The MIT-licensed code, the LoRA weights, and the training dataset are all publicly released, making the model directly usable and reproducible.
ProtProfileMD demonstrates that conformational dynamics, normally accessible only through molecular dynamics or experiment, can be partly captured by a discrete structural alphabet and predicted from sequence with a small adapter on an existing protein language model. By extending ProstT5 from single-structure translation to distributional, dynamics-aware prediction, it offers a practical route to flexibility annotation at proteome scale. Its openness is a notable strength relative to many contemporaneous preprints: the manuscript is CC BY licensed and the code (MIT), adapter weights, and dataset are publicly available. Important caveats remain: the work is a January 2026 preprint that has not yet been peer reviewed, the 3Di alphabet is a coarse approximation of true atomic motion, and the HuggingFace base_model field is mislabeled even though the README clearly identifies ProstT5 as the base.