A geometry-complete, SE(3)-equivariant vector-quantized autoencoder that tokenizes protein 3D backbones into discrete tokens while preserving orientation and chirality.
Discrete structure tokenizers have become a foundational building block for protein machine learning: by mapping continuous 3D coordinates to a small vocabulary of tokens, they let structure be modeled with the same language-model and generative machinery used for sequence. The quality of these tokenizers—how faithfully a backbone can be reconstructed from its tokens, and how efficiently—directly bounds the performance of everything built on top of them. GCP-VQVAE, from the University of Missouri, is a structure tokenizer designed to be geometry-complete: it preserves both orientation and chirality, properties that simpler invariant encoders can discard.
The model couples an SE(3)-equivariant GCPNet encoder with a vector-quantized codebook and a Transformer decoder, training the whole system as an autoencoder on 24 million monomer protein backbone structures from the AlphaFold Protein Structure Database. The result is a discrete representation that reconstructs backbones to sub-angstrom accuracy while running hundreds of times faster than prior state-of-the-art tokenizers, with model weights released under an MIT license.
GCP-VQVAE encodes a protein backbone with an SE(3)-equivariant GCPNet, quantizes the result against a 4,096-token vocabulary using a rotation- and translation-invariant quantization scheme, and reconstructs coordinates with a Transformer decoder that includes a 6D rotation head. Trained on 24M AlphaFold monomer structures, it reports backbone RMSDs of 0.4377 Å on CAMEO2024, 0.5293 Å on CASP15, and 0.7567 Å on CASP16, plus 0.8193 Å RMSD and 0.9673 TM-score on a zero-shot set of 1,938 new structures— all at 100% codebook utilization. The implementation requires Python 3.10+ and PyTorch 2.5+, with roughly 12 GB of GPU memory recommended for inference.
A high-fidelity, fast structure tokenizer is most valuable as infrastructure: it turns protein backbones into discrete tokens that downstream models can generate, edit, search, or condition on. GCP-VQVAE can serve as the structural front-end for protein structure language models, generative backbone design, structure-based retrieval, and any pipeline that benefits from compressing 3D geometry into a compact, reconstructable code. The released large and lite checkpoints let groups trade accuracy for speed depending on their compute budget.
By emphasizing geometric completeness and reporting both sub-angstrom reconstruction and large speedups, GCP-VQVAE pushes on two axes that matter most for downstream generative and representation work built on discrete structure tokens. Its permissive MIT-licensed weights lower the barrier to adoption as a drop-in tokenizer. As a 2025 preprint (currently in its third revision), its standing relative to other structure tokenizers will be clarified through peer review and independent benchmarking.