GCP-VQVAE

Protein structure tokenizer that maps 3D backbones to discrete tokens with an SE(3)-equivariant encoder preserving orientation and chirality.

Released: October 2025

Discrete structure tokenizers have become a foundational building block for protein machine learning: by mapping continuous 3D coordinates to a small vocabulary of tokens, they let structure be modeled with the same language-model and generative machinery used for sequence. The quality of these tokenizers—how faithfully a backbone can be reconstructed from its tokens, and how efficiently—directly bounds the performance of everything built on top of them. GCP-VQVAE, from the University of Missouri, is a structure tokenizer designed to be geometry-complete: it preserves both orientation and chirality, properties that simpler invariant encoders can discard.

The model couples an SE(3)-equivariant GCPNet encoder with a vector-quantized codebook and a Transformer decoder, training the whole system as an autoencoder on 24 million monomer protein backbone structures from the AlphaFold Protein Structure Database. The result is a discrete representation that reconstructs backbones to sub-angstrom accuracy while running hundreds of times faster than prior state-of-the-art tokenizers, with model weights released under an MIT license.

Key Features

Geometry-complete tokenization: An SE(3)-equivariant GCPNet encoder preserves orientation and chirality, avoiding the information loss of purely invariant encoders.
Released weights (MIT): Both a full "large" and a lightweight "lite" checkpoint are available on Hugging Face under a permissive MIT license.
High-fidelity reconstruction: Achieves sub-angstrom backbone RMSD on standard benchmarks and reports 100% codebook utilization.
Large-scale training: Trained on 24 million AlphaFold monomer backbone structures, giving broad structural coverage.
Efficiency: Reported as 408× and 530× faster than the previous state-of-the-art tokenizer on its evaluation settings.

Technical Details

GCP-VQVAE encodes a protein backbone with an SE(3)-equivariant GCPNet, quantizes the result against a 4,096-token vocabulary using a rotation- and translation-invariant quantization scheme, and reconstructs coordinates with a Transformer decoder that includes a 6D rotation head. Trained on 24M AlphaFold monomer structures, it reports backbone RMSDs of 0.4377 Å on CAMEO2024, 0.5293 Å on CASP15, and 0.7567 Å on CASP16, plus 0.8193 Å RMSD and 0.9673 TM-score on a zero-shot set of 1,938 new structures— all at 100% codebook utilization. The implementation requires Python 3.10+ and PyTorch 2.5+, with roughly 12 GB of GPU memory recommended for inference.

Applications

A high-fidelity, fast structure tokenizer is most valuable as infrastructure: it turns protein backbones into discrete tokens that downstream models can generate, edit, search, or condition on. GCP-VQVAE can serve as the structural front-end for protein structure language models, generative backbone design, structure-based retrieval, and any pipeline that benefits from compressing 3D geometry into a compact, reconstructable code. The released large and lite checkpoints let groups trade accuracy for speed depending on their compute budget.

Impact

By emphasizing geometric completeness and reporting both sub-angstrom reconstruction and large speedups, GCP-VQVAE pushes on two axes that matter most for downstream generative and representation work built on discrete structure tokens. Its permissive MIT-licensed weights lower the barrier to adoption as a drop-in tokenizer. As a 2025 preprint (currently in its third revision), its standing relative to other structure tokenizers will be clarified through peer review and independent benchmarking.

Citation

GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure

Preprint

Pourmirzaei, M., et al. (2026) GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure. bioRxiv.

DOI: 10.1101/2025.10.01.679833

Recent citations

Papers that recently cited this model.

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design
Chen Wei, Fanding Xu, Minghao Sun, et al.
May 2026
0
RiboSphere: Learning Unified and Efficient Representations of RNA Structures
Zhou Zhang, Hanqun Cao, Cheng Tan, et al.
Mar 2026
0

Top citations

The most-cited papers that cite this model.

RiboSphere: Learning Unified and Efficient Representations of RNA Structures
Zhou Zhang, Hanqun Cao, Cheng Tan, et al.
Mar 2026
0
Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design
Chen Wei, Fanding Xu, Minghao Sun, et al.
May 2026
0

Citations

Total Citations2

Influential0

References57

GitHub

Stars43

Forks2

Open Issues0

Contributors5

Last Push2mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes0

Last Modified5mo ago

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

86Open

Usability — can I run it?100

Reproducibility — can I retrain it?78

Model Openness Framework

Class II

Open Tooling

Resources

GitHub Repository Research Paper HuggingFace Model HuggingFace Model

Key Features

Geometry-complete tokenization: An SE(3)-equivariant GCPNet encoder preserves orientation and chirality, avoiding the information loss of purely invariant encoders.

Released weights (MIT): Both a full "large" and a lightweight "lite" checkpoint are available on Hugging Face under a permissive MIT license.

High-fidelity reconstruction: Achieves sub-angstrom backbone RMSD on standard benchmarks and reports 100% codebook utilization.

Large-scale training: Trained on 24 million AlphaFold monomer backbone structures, giving broad structural coverage.

Efficiency: Reported as 408× and 530× faster than the previous state-of-the-art tokenizer on its evaluation settings.

Technical Details

Applications

Impact

GCP-VQVAE

Key Features

Technical Details

Applications

Impact

Citation

GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure

Recent citations

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

RiboSphere: Learning Unified and Efficient Representations of RNA Structures

Top citations

RiboSphere: Learning Unified and Efficient Representations of RNA Structures

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GCP-VQVAE

Key Features

Technical Details

Applications

Impact

Citation

GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure

Recent citations

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

RiboSphere: Learning Unified and Efficient Representations of RNA Structures

Top citations

RiboSphere: Learning Unified and Efficient Representations of RNA Structures

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GCP-VQVAE

#Key Features

#Technical Details

#Applications

#Impact

Citation

GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure

Recent citations

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

RiboSphere: Learning Unified and Efficient Representations of RNA Structures

Top citations

RiboSphere: Learning Unified and Efficient Representations of RNA Structures

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GCP-VQVAE

#Key Features

#Technical Details

#Applications

#Impact

Citation

GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure

Recent citations

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

RiboSphere: Learning Unified and Efficient Representations of RNA Structures

Top citations

RiboSphere: Learning Unified and Efficient Representations of RNA Structures

Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact