Overview

SaProt is a protein language model developed at Westlake University that addresses a fundamental limitation of sequence-only protein language models: the inability to directly encode three-dimensional structural information during pre-training. Rather than relying solely on amino acid tokens, SaProt introduces a structure-aware vocabulary that pairs each residue's one-letter amino acid code with a corresponding structural state token derived from Foldseek's 3Di alphabet. This dual-token representation allows the model to learn the joint language of protein sequence and structure simultaneously, rather than treating structure as a secondary prediction target.

The 3Di tokens are generated by running Foldseek on experimental PDB structures or AlphaFold 2 predictions, encoding local backbone geometry and side-chain orientation into one of 20 discrete structural states. Each residue in the input is represented as a two-character token — for example, "Ac" where "A" is alanine and "c" is the corresponding 3Di state. Regions with low AlphaFold 2 confidence (pLDDT below 70) are masked with a placeholder character, preventing noisy structural assignments from degrading pre-training signal. This design allows SaProt to be applied to any protein with a known or predicted structure, which in practice means it can leverage the entire AlphaFold Database.

SaProt was presented as a spotlight paper at ICLR 2024, one of the most competitive machine learning venues, signaling broad recognition of its methodological contribution. The work was led by Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. A follow-on ecosystem, SaprotHub, was subsequently published in Nature Biotechnology, providing no-code fine-tuning and model-sharing infrastructure built on top of the SaProt foundation.

Key Features

Structure-aware vocabulary: Each residue is encoded as a paired (amino acid, 3Di) token, giving the model direct access to local structural context at every position rather than requiring it to infer structure from sequence alone.
Foldseek 3Di integration: Structural tokens are derived from Foldseek's VQ-VAE-based encoding of protein backbone geometry into 20 discrete states, a compact and computationally efficient structural alphabet originally designed for fast protein structure search.
Superior mutation effect prediction: On the ClinVar clinical pathogenicity benchmark, the 650M model achieves an AUC of 0.909 compared to 0.862 for ESM-2 of equivalent size. SaProt ranked first on the ProteinGym zero-shot mutagenesis benchmark as of April 2024.
Broad downstream task coverage: Benchmarked across 10 tasks spanning thermostability prediction, enzyme commission (EC) classification, Gene Ontology annotation (MF/BP/CC), subcellular localization (DeepLoc), human protein-protein interaction prediction, and metal ion binding.
Multiple model scales: Available in 35M, 650M, and 1.3B parameter variants, with checkpoints trained on AlphaFold2 structures, PDB experimental structures, and large-scale sequence databases including OMG_prot50 and NCBI sequences.
Training and inference code: Unlike some protein language model releases that provide only inference weights, SaProt provides full training code, enabling the research community to reproduce and extend the pre-training approach.

Technical Details

SaProt's 650M parameter flagship model shares its architecture with ESM-2 — a transformer encoder trained with masked language modeling — but extends the tokenizer to accommodate the paired sequence-structure vocabulary. Pre-training proceeded in two phases: an initial phase on approximately 40 million AlphaFold2-predicted structures covering the breadth of known protein sequence space, followed by a second phase incorporating roughly 60,000 experimental PDB structures to ground the model in high-quality empirical data. The largest 1.3B variant extends further, incorporating 200 million OMG_prot50 sequences and 150 million NCBI sequences filtered at 70% identity. Training the 650M model required 64 NVIDIA A100 80GB GPUs running for approximately three months.

Benchmark results from the 650M PDB-trained checkpoint are consistently above ESM-2 baselines of the same size: Spearman's rho on thermostability improves from 0.680 to 0.724, Fmax on enzyme classification improves from 0.868 to 0.882, and subcellular localization accuracy improves from 82.09% to 85.57%. Contact prediction tasks show especially pronounced gains, reflecting the structural information directly encoded in the 3Di tokens. At inference time, structural tokens can be generated on the fly by running Foldseek on any input PDB or CIF file, and the 1.3B model is documented to perform competitively even when only amino acid tokens are provided, making it applicable to proteins without available structures.

Applications

SaProt is well-suited for researchers working on protein function annotation, variant effect prediction, and protein engineering. Clinical researchers can apply its zero-shot mutation scoring to prioritize variants of uncertain significance in ClinVar-style databases. Protein engineers can use it to predict thermostability changes, enzyme activity, and binding affinity shifts without requiring labeled training data for each new protein family. Computational biologists can fine-tune task-specific models using SaprotHub's no-code Colab interface, which packages LoRA-based fine-tuning into a point-and-click workflow that runs on free-tier Colab hardware. The model is also a natural starting point for structural bioinformatics pipelines that combine Foldseek-based structure search with sequence-based machine learning.

Impact

SaProt represents a conceptually important step toward truly multimodal protein representations, demonstrating that integrating structural information directly into the input vocabulary — rather than through auxiliary supervision or post-hoc structure modules — yields consistent and substantial performance improvements across diverse tasks. Its ICLR 2024 Spotlight designation reflects peer recognition of this contribution in the broader machine learning community, not only within computational biology. The SaprotHub follow-on, published in Nature Biotechnology in 2025, has extended the model's reach by removing the technical barriers to fine-tuning and sharing, making SaProt accessible to wet-lab biologists and researchers without deep ML expertise. A practical limitation is the dependency on structural input: the 35M and 650M models rely on 3Di tokens for optimal performance, requiring access to AlphaFold2 predictions or experimental structures for every protein of interest.

Overview

Key Features

Structure-aware vocabulary: Each residue is encoded as a paired (amino acid, 3Di) token, giving the model direct access to local structural context at every position rather than requiring it to infer structure from sequence alone.

Foldseek 3Di integration: Structural tokens are derived from Foldseek's VQ-VAE-based encoding of protein backbone geometry into 20 discrete states, a compact and computationally efficient structural alphabet originally designed for fast protein structure search.

Superior mutation effect prediction: On the ClinVar clinical pathogenicity benchmark, the 650M model achieves an AUC of 0.909 compared to 0.862 for ESM-2 of equivalent size. SaProt ranked first on the ProteinGym zero-shot mutagenesis benchmark as of April 2024.

Broad downstream task coverage: Benchmarked across 10 tasks spanning thermostability prediction, enzyme commission (EC) classification, Gene Ontology annotation (MF/BP/CC), subcellular localization (DeepLoc), human protein-protein interaction prediction, and metal ion binding.

Multiple model scales: Available in 35M, 650M, and 1.3B parameter variants, with checkpoints trained on AlphaFold2 structures, PDB experimental structures, and large-scale sequence databases including OMG_prot50 and NCBI sequences.

Training and inference code: Unlike some protein language model releases that provide only inference weights, SaProt provides full training code, enabling the research community to reproduce and extend the pre-training approach.

Technical Details

Applications

Impact

SaProt

Overview

Key Features

Technical Details

Applications

Impact

Citation

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Metrics

GitHub

Citations

HuggingFace

Tags

Resources

SaProt

Overview

Key Features

Technical Details

Applications

Impact

Citation

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Metrics

GitHub

Citations

HuggingFace

Tags

Resources