A residual VQ-VAE tokenizer that learns a discrete alphabet of protein conformational ensembles from molecular dynamics data, usable as a frozen representation layer for downstream tasks.
Most protein structure tokenizers compress a single static conformation into a sequence of discrete tokens, discarding the correlated motions and alternative states that define how proteins actually function. ENSEMBITS, introduced by Kaiwen Shi and Carlos Oliver of the Oliver Laboratory at Vanderbilt University in May 2026, is presented as the first tokenizer of protein conformational ensembles. Rather than encoding one structure, it learns a discrete "alphabet" of conformational states directly from molecular dynamics (MD) trajectories, turning the dynamic behavior of a protein into a compact, reusable representation.
The central idea is to treat conformational dynamics the way protein language models treat sequence: as something that can be tokenized once and then reused as a frozen feature layer across many downstream problems. ENSEMBITS is trained on a large MD corpus and, once trained, the tokenizer is fixed — it can encode new conformations without any retraining. This separates the expensive, data-hungry step of learning dynamics representations from the comparatively cheap step of fitting task-specific probes on top of them.
A key innovation is frame distillation, an objective that lets the model predict dynamics tokens from a single predicted structure. Because high-quality MD data remain sparse relative to the universe of known proteins, this distillation step reduces ENSEMBITS' dependence on having full trajectories available at inference time, broadening where the learned alphabet can be applied.
ENSEMBITS is a residual VQ-VAE trained with a frame distillation objective. It
derives per-residue geometric descriptors across multiple conformations and
quantizes them into discrete tokens via a residual codebook stack, designed to be
permutation-invariant to ensemble ordering and robust to sparse dynamics data. The
released tokenizer is built on an ESM3-derived geometric encoding (the production
checkpoint is named combined_esm3). Training draws on a large molecular dynamics
corpus assembled from mdCATH, ATLAS, and MISATO. On evaluation, the authors report
that ENSEMBITS outperforms all compared methods on RMSF (root-mean-square
fluctuation) prediction, and it is further assessed on EC and GO function
prediction, binding analysis, and zero-shot mutation-effect prediction, alongside
token-conditioned ANOVA tests that relate individual tokens to per-residue motion
amplitude. The reproduction package ships eight baseline implementations and six
downstream probes for these tasks.
ENSEMBITS targets researchers who need to reason about protein flexibility and dynamics without running new MD simulations for every protein. Because the frozen tokenizer turns conformational behavior into discrete features, it can supply inputs to lightweight probes for predicting residue-level flexibility (RMSF), annotating enzyme (EC) and Gene Ontology (GO) function, characterizing binding sites and affinities, and scoring the effects of point mutations in a zero-shot setting. This makes it relevant to structural bioinformatics, protein engineering, and drug-discovery workflows where motion — not just a single folded snapshot — carries the signal of interest.
ENSEMBITS extends the "tokenize once, reuse everywhere" paradigm that has shaped protein sequence and structure modeling into the domain of conformational dynamics, where labeled and simulated data are comparatively scarce. By packaging ensemble behavior into a discrete alphabet and using frame distillation to infer those tokens from single structures, it offers a path to bring dynamics-aware features to tasks that historically relied on static structure alone. The work is a 2026 arXiv preprint and has not yet undergone peer review, so reported benchmark advantages should be read as preliminary. The authors release code under an MIT license with a production tokenizer checkpoint and ~9 GB of supporting assets (including token caches) distributed via Zenodo, supporting reproduction and reuse.