ENSEMBITS

Protein conformational ensemble tokenizer that learns a discrete alphabet of states from molecular dynamics, reusable as a frozen feature layer.

Released: May 2026

Most protein structure tokenizers compress a single static conformation into a sequence of discrete tokens, discarding the correlated motions and alternative states that define how proteins actually function. ENSEMBITS, introduced by Kaiwen Shi and Carlos Oliver of the Oliver Laboratory at Vanderbilt University in May 2026, is presented as the first tokenizer of protein conformational ensembles. Rather than encoding one structure, it learns a discrete "alphabet" of conformational states directly from molecular dynamics (MD) trajectories, turning the dynamic behavior of a protein into a compact, reusable representation.

The central idea is to treat conformational dynamics the way protein language models treat sequence: as something that can be tokenized once and then reused as a frozen feature layer across many downstream problems. ENSEMBITS is trained on a large MD corpus and, once trained, the tokenizer is fixed — it can encode new conformations without any retraining. This separates the expensive, data-hungry step of learning dynamics representations from the comparatively cheap step of fitting task-specific probes on top of them.

A key innovation is frame distillation, an objective that lets the model predict dynamics tokens from a single predicted structure. Because high-quality MD data remain sparse relative to the universe of known proteins, this distillation step reduces ENSEMBITS' dependence on having full trajectories available at inference time, broadening where the learned alphabet can be applied.

Key Features

Ensemble-level tokenization: Encodes correlated motion and alternative conformational states across an ensemble, rather than collapsing a protein to a single static structure.
Residual VQ-VAE backbone: A residual vector-quantized variational autoencoder produces a multi-codebook discrete representation, addressing geometric descriptor extraction, permutation-invariance over variable-size ensembles, and sparsity in dynamics data.
Frozen, reusable tokenizer: After training the tokenizer is fixed and can encode new conformations without retraining, serving as a pretrained representation layer for diverse downstream tasks.
Frame distillation: An objective that predicts dynamics tokens from a single predicted structure, lessening reliance on scarce full MD trajectories at inference.
Broad downstream coverage: Demonstrated on RMSF prediction, EC/GO function annotation, binding site/affinity prediction, and zero-shot mutation-effect prediction from a single learned alphabet.

Technical Details

ENSEMBITS is a residual VQ-VAE trained with a frame distillation objective. It derives per-residue geometric descriptors across multiple conformations and quantizes them into discrete tokens via a residual codebook stack, designed to be permutation-invariant to ensemble ordering and robust to sparse dynamics data. The released tokenizer is built on an ESM3-derived geometric encoding (the production checkpoint is named combined_esm3). Training draws on a large molecular dynamics corpus assembled from mdCATH, ATLAS, and MISATO. On evaluation, the authors report that ENSEMBITS outperforms all compared methods on RMSF (root-mean-square fluctuation) prediction, and it is further assessed on EC and GO function prediction, binding analysis, and zero-shot mutation-effect prediction, alongside token-conditioned ANOVA tests that relate individual tokens to per-residue motion amplitude. The reproduction package ships eight baseline implementations and six downstream probes for these tasks.

Applications

ENSEMBITS targets researchers who need to reason about protein flexibility and dynamics without running new MD simulations for every protein. Because the frozen tokenizer turns conformational behavior into discrete features, it can supply inputs to lightweight probes for predicting residue-level flexibility (RMSF), annotating enzyme (EC) and Gene Ontology (GO) function, characterizing binding sites and affinities, and scoring the effects of point mutations in a zero-shot setting. This makes it relevant to structural bioinformatics, protein engineering, and drug-discovery workflows where motion — not just a single folded snapshot — carries the signal of interest.

Impact

ENSEMBITS extends the "tokenize once, reuse everywhere" paradigm that has shaped protein sequence and structure modeling into the domain of conformational dynamics, where labeled and simulated data are comparatively scarce. By packaging ensemble behavior into a discrete alphabet and using frame distillation to infer those tokens from single structures, it offers a path to bring dynamics-aware features to tasks that historically relied on static structure alone. The work is a 2026 arXiv preprint and has not yet undergone peer review, so reported benchmark advantages should be read as preliminary. The authors release code under an MIT license with a production tokenizer checkpoint and ~9 GB of supporting assets (including token caches) distributed via Zenodo, supporting reproduction and reuse.

Citation

ENSEMBITS: an alphabet of protein conformational ensembles

Preprint

Shi, K. & Oliver, C. (2026) ENSEMBITS: an alphabet of protein conformational ensembles. arXiv.

DOI: 10.48550/arXiv.2605.13789

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations1

Influential0

References49

GitHub

Stars7

Forks0

Open Issues0

Contributors2

Last Push1mo ago

LanguagePython

LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

66Partial

Usability — can I run it?72

Reproducibility — can I retrain it?58

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Dataset

Key Features

Ensemble-level tokenization: Encodes correlated motion and alternative conformational states across an ensemble, rather than collapsing a protein to a single static structure.

Residual VQ-VAE backbone: A residual vector-quantized variational autoencoder produces a multi-codebook discrete representation, addressing geometric descriptor extraction, permutation-invariance over variable-size ensembles, and sparsity in dynamics data.

Frozen, reusable tokenizer: After training the tokenizer is fixed and can encode new conformations without retraining, serving as a pretrained representation layer for diverse downstream tasks.

Frame distillation: An objective that predicts dynamics tokens from a single predicted structure, lessening reliance on scarce full MD trajectories at inference.

Broad downstream coverage: Demonstrated on RMSF prediction, EC/GO function annotation, binding site/affinity prediction, and zero-shot mutation-effect prediction from a single learned alphabet.

Technical Details

Applications

Impact

ENSEMBITS

Key Features

Technical Details

Applications

Impact

Citation

ENSEMBITS: an alphabet of protein conformational ensembles

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ENSEMBITS

Key Features

Technical Details

Applications

Impact

Citation

ENSEMBITS: an alphabet of protein conformational ensembles

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ENSEMBITS

#Key Features

#Technical Details

#Applications

#Impact

Citation

ENSEMBITS: an alphabet of protein conformational ensembles

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ENSEMBITS

#Key Features

#Technical Details

#Applications

#Impact

Citation

ENSEMBITS: an alphabet of protein conformational ensembles

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact