A global protein structure tokenizer whose successive tokens add increasing detail, enabling adaptive-length representations, better generation, and zero-shot protein design.
Tokenizing three-dimensional protein structure into discrete or compressed codes has become a foundational building block for modern structure-generation and structure-aware language models. The dominant recipe pools information from local spatial neighborhoods, producing one token per residue (or per fixed block) regardless of how much structural information that region actually carries. This local strategy is convenient but has costs: it fixes the representation length to the sequence length, and it can let reconstruction errors accumulate when tokens are fed autoregressively into generative models.
APT (Adaptive Protein Tokenization), introduced by Rohit Dilip, Ayush Varshney, and David Van Valen at Caltech in a February 2026 arXiv preprint, proposes a different philosophy. Instead of pooling local neighborhoods, APT tokenizes a structure globally: each successive token contributes an additional increment of detail to a single, whole-structure representation. Early tokens sketch the coarse global shape, and later tokens progressively refine it, so the information content of a representation can be adapted to the task at hand rather than being tied rigidly to the number of residues.
This adaptive, coarse-to-fine scheme decouples representation length from sequence length and lets a model spend more or fewer tokens depending on how much structural detail a downstream task requires. The authors report that this design reduces error accumulation in generative settings and yields embeddings without sequence-reduction operations.
APT is a structure tokenizer that learns to encode a protein into an ordered sequence of tokens where information accrues globally and incrementally. It was evaluated across three task families. On reconstruction and generative tasks, APT matched or exceeded existing local-tokenizer-based models. On representation tasks, non-linear probing of APT tokens outperformed competing tokenizers on CATH classification. On applications, the adaptive framework enabled zero-shot protein shrinking and affinity maturation, with the authors noting that adapting the number of tokens to information content "boosts designability." As a February 2026 preprint, full architectural specifics such as parameter count, codebook size, and training corpus await the complete release; code and trained weights have not yet been published.
APT is aimed at researchers building protein structure-generation pipelines and structure-aware representation models, as well as protein engineers. Because its tokens are adaptive and global, it can serve as a drop-in front end for generative models that need controllable representation budgets, and as a structural embedding for classification and retrieval. The demonstrated zero-shot capabilities — shrinking proteins and maturing binding affinity — point to direct use in design campaigns where one wants to edit a structure toward a goal without training a bespoke model for each task.
APT reframes a widely used component of the protein-AI stack, arguing that how structure is tokenized — locally versus globally, fixed-length versus adaptive — materially affects both generation quality and designability. By showing competitive reconstruction, stronger fold-classification probes, and zero-shot design behavior from a single adaptive representation, it offers an alternative to the per-residue tokenizers now common in structure models. As a recent preprint without released code or weights, its results await peer review and independent reproduction, but the adaptive-tokenization idea is a notable contribution to how structural foundation models are designed.