bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

ENSEMBITS

Vanderbilt University

A residual VQ-VAE tokenizer that learns a discrete alphabet of protein conformational ensembles from molecular dynamics data, usable as a frozen representation layer for downstream tasks.

Released: May 2026

Most protein structure tokenizers compress a single static conformation into a sequence of discrete tokens, discarding the correlated motions and alternative states that define how proteins actually function. ENSEMBITS, introduced by Kaiwen Shi and Carlos Oliver of the Oliver Laboratory at Vanderbilt University in May 2026, is presented as the first tokenizer of protein conformational ensembles. Rather than encoding one structure, it learns a discrete "alphabet" of conformational states directly from molecular dynamics (MD) trajectories, turning the dynamic behavior of a protein into a compact, reusable representation.

The central idea is to treat conformational dynamics the way protein language models treat sequence: as something that can be tokenized once and then reused as a frozen feature layer across many downstream problems. ENSEMBITS is trained on a large MD corpus and, once trained, the tokenizer is fixed — it can encode new conformations without any retraining. This separates the expensive, data-hungry step of learning dynamics representations from the comparatively cheap step of fitting task-specific probes on top of them.

A key innovation is frame distillation, an objective that lets the model predict dynamics tokens from a single predicted structure. Because high-quality MD data remain sparse relative to the universe of known proteins, this distillation step reduces ENSEMBITS' dependence on having full trajectories available at inference time, broadening where the learned alphabet can be applied.

#Key Features

  • Ensemble-level tokenization: Encodes correlated motion and alternative conformational states across an ensemble, rather than collapsing a protein to a single static structure.
  • Residual VQ-VAE backbone: A residual vector-quantized variational autoencoder produces a multi-codebook discrete representation, addressing geometric descriptor extraction, permutation-invariance over variable-size ensembles, and sparsity in dynamics data.
  • Frozen, reusable tokenizer: After training the tokenizer is fixed and can encode new conformations without retraining, serving as a pretrained representation layer for diverse downstream tasks.
  • Frame distillation: An objective that predicts dynamics tokens from a single predicted structure, lessening reliance on scarce full MD trajectories at inference.
  • Broad downstream coverage: Demonstrated on RMSF prediction, EC/GO function annotation, binding site/affinity prediction, and zero-shot mutation-effect prediction from a single learned alphabet.

#Technical Details

ENSEMBITS is a residual VQ-VAE trained with a frame distillation objective. It derives per-residue geometric descriptors across multiple conformations and quantizes them into discrete tokens via a residual codebook stack, designed to be permutation-invariant to ensemble ordering and robust to sparse dynamics data. The released tokenizer is built on an ESM3-derived geometric encoding (the production checkpoint is named combined_esm3). Training draws on a large molecular dynamics corpus assembled from mdCATH, ATLAS, and MISATO. On evaluation, the authors report that ENSEMBITS outperforms all compared methods on RMSF (root-mean-square fluctuation) prediction, and it is further assessed on EC and GO function prediction, binding analysis, and zero-shot mutation-effect prediction, alongside token-conditioned ANOVA tests that relate individual tokens to per-residue motion amplitude. The reproduction package ships eight baseline implementations and six downstream probes for these tasks.

#Applications

ENSEMBITS targets researchers who need to reason about protein flexibility and dynamics without running new MD simulations for every protein. Because the frozen tokenizer turns conformational behavior into discrete features, it can supply inputs to lightweight probes for predicting residue-level flexibility (RMSF), annotating enzyme (EC) and Gene Ontology (GO) function, characterizing binding sites and affinities, and scoring the effects of point mutations in a zero-shot setting. This makes it relevant to structural bioinformatics, protein engineering, and drug-discovery workflows where motion — not just a single folded snapshot — carries the signal of interest.

#Impact

ENSEMBITS extends the "tokenize once, reuse everywhere" paradigm that has shaped protein sequence and structure modeling into the domain of conformational dynamics, where labeled and simulated data are comparatively scarce. By packaging ensemble behavior into a discrete alphabet and using frame distillation to infer those tokens from single structures, it offers a path to bring dynamics-aware features to tasks that historically relied on static structure alone. The work is a 2026 arXiv preprint and has not yet undergone peer review, so reported benchmark advantages should be read as preliminary. The authors release code under an MIT license with a production tokenizer checkpoint and ~9 GB of supporting assets (including token caches) distributed via Zenodo, supporting reproduction and reuse.

Citation

Preprint

DOI: 10.48550/arXiv.2605.13789

DOI: 10.48550/arXiv.2605.13789

Openness

Unclassified
Restrictive license on core components

Tags

function_predictionmolecular_dynamicsprotein_dynamicsrepresentation_learningself_supervisedvariant_effect_predictionvector_quantized_autoencoderzero_shot

Resources

GitHub RepositoryResearch PaperDataset