APT (Adaptive Protein Tokenization)

Protein structure tokenizer that encodes a whole structure globally, with each successive token adding detail for adaptive-length representations.

Released: February 2026

Tokenizing three-dimensional protein structure into discrete or compressed codes has become a foundational building block for modern structure-generation and structure-aware language models. The dominant recipe pools information from local spatial neighborhoods, producing one token per residue (or per fixed block) regardless of how much structural information that region actually carries. This local strategy is convenient but has costs: it fixes the representation length to the sequence length, and it can let reconstruction errors accumulate when tokens are fed autoregressively into generative models.

APT (Adaptive Protein Tokenization), introduced by Rohit Dilip, Ayush Varshney, and David Van Valen at Caltech in a February 2026 arXiv preprint, proposes a different philosophy. Instead of pooling local neighborhoods, APT tokenizes a structure globally: each successive token contributes an additional increment of detail to a single, whole-structure representation. Early tokens sketch the coarse global shape, and later tokens progressively refine it, so the information content of a representation can be adapted to the task at hand rather than being tied rigidly to the number of residues.

This adaptive, coarse-to-fine scheme decouples representation length from sequence length and lets a model spend more or fewer tokens depending on how much structural detail a downstream task requires. The authors report that this design reduces error accumulation in generative settings and yields embeddings without sequence-reduction operations.

Key Features

Global, coarse-to-fine tokenization: Successive tokens add increasing levels of detail to a single global representation, rather than encoding fixed local neighborhoods.
Adaptive representation length: The number of tokens can be tuned to the information content needed for a task, decoupling representation size from the number of residues.
Reduced generative error accumulation: By avoiding local-pooling tokens fed sequentially, APT mitigates the compounding errors that can degrade autoregressive structure generation.
Strong representations: Non-linear probing on APT token sequences outperformed other tokenizers on CATH fold classification, indicating the tokens capture useful structural semantics.
Zero-shot design applications: The authors demonstrate zero-shot protein "shrinking" and affinity maturation, using the adaptive tokens to propose structural edits without task-specific retraining.

Technical Details

APT is a structure tokenizer that learns to encode a protein into an ordered sequence of tokens where information accrues globally and incrementally. It was evaluated across three task families. On reconstruction and generative tasks, APT matched or exceeded existing local-tokenizer-based models. On representation tasks, non-linear probing of APT tokens outperformed competing tokenizers on CATH classification. On applications, the adaptive framework enabled zero-shot protein shrinking and affinity maturation, with the authors noting that adapting the number of tokens to information content "boosts designability." As a February 2026 preprint, full architectural specifics such as parameter count, codebook size, and training corpus await the complete release; code and trained weights have not yet been published.

Applications

APT is aimed at researchers building protein structure-generation pipelines and structure-aware representation models, as well as protein engineers. Because its tokens are adaptive and global, it can serve as a drop-in front end for generative models that need controllable representation budgets, and as a structural embedding for classification and retrieval. The demonstrated zero-shot capabilities — shrinking proteins and maturing binding affinity — point to direct use in design campaigns where one wants to edit a structure toward a goal without training a bespoke model for each task.

Impact

APT reframes a widely used component of the protein-AI stack, arguing that how structure is tokenized — locally versus globally, fixed-length versus adaptive — materially affects both generation quality and designability. By showing competitive reconstruction, stronger fold-classification probes, and zero-shot design behavior from a single adaptive representation, it offers an alternative to the per-residue tokenizers now common in structure models. As a recent preprint without released code or weights, its results await peer review and independent reproduction, but the adaptive-tokenization idea is a notable contribution to how structural foundation models are designed.

Citation

Adaptive Protein Tokenization

Preprint

Dilip, R., et al. (2026) Adaptive Protein Tokenization. arXiv.org.

DOI: 10.48550/arXiv.2602.06418

Recent citations

Papers that recently cited this model.

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search
Rohit Dilip, Songrong Qu, Zhen Chen, et al.
bioRxiv · May 2026
0
Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation
Nabin Giri, Steven Farrell, Kristofer E. Bouchard
May 2026
0

Top citations

The most-cited papers that cite this model.

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search
Rohit Dilip, Songrong Qu, Zhen Chen, et al.
bioRxiv · May 2026
0
Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation
Nabin Giri, Steven Farrell, Kristofer E. Bouchard
May 2026
0

Citations

Total Citations2

Influential0

References67

Fields of citing research

Biology100%
Computer Science100%
Medicine50%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

6Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper Official Website

Key Features

Global, coarse-to-fine tokenization: Successive tokens add increasing levels of detail to a single global representation, rather than encoding fixed local neighborhoods.

Adaptive representation length: The number of tokens can be tuned to the information content needed for a task, decoupling representation size from the number of residues.

Reduced generative error accumulation: By avoiding local-pooling tokens fed sequentially, APT mitigates the compounding errors that can degrade autoregressive structure generation.

Strong representations: Non-linear probing on APT token sequences outperformed other tokenizers on CATH fold classification, indicating the tokens capture useful structural semantics.

Zero-shot design applications: The authors demonstrate zero-shot protein "shrinking" and affinity maturation, using the adaptive tokens to propose structural edits without task-specific retraining.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Rohit Dilip, Songrong Qu, Zhen Chen, et al.

bioRxiv · May 2026

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Nabin Giri, Steven Farrell, Kristofer E. Bouchard

May 2026

Top citations

The most-cited papers that cite this model.

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Rohit Dilip, Songrong Qu, Zhen Chen, et al.

bioRxiv · May 2026

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Nabin Giri, Steven Farrell, Kristofer E. Bouchard

May 2026

APT (Adaptive Protein Tokenization)

Key Features

Technical Details

Applications

Impact

Citation

Adaptive Protein Tokenization

Recent citations

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Top citations

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Citations

Fields of citing research

Openness

Tags

Resources

APT (Adaptive Protein Tokenization)

Key Features

Technical Details

Applications

Impact

Citation

Adaptive Protein Tokenization

Recent citations

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Top citations

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Citations

Fields of citing research

Openness

Tags

Resources

APT (Adaptive Protein Tokenization)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Adaptive Protein Tokenization

Recent citations

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Top citations

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Related models

Citations

Fields of citing research

Openness

Tags

Resources

APT (Adaptive Protein Tokenization)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Adaptive Protein Tokenization

Recent citations

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Top citations

Automated assembly of protein complexes from cryo-EM maps with structure-informed Monte Carlo Tree Search

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact