Protein language model combining amino acid and Foldseek 3Di structural tokens, outperforming ESM-2 across 10 downstream tasks including mutation effect prediction.
SaProt is a protein language model developed at Westlake University that addresses a fundamental limitation of sequence-only protein language models: the inability to directly encode three-dimensional structural information during pre-training. Rather than relying solely on amino acid tokens, SaProt introduces a structure-aware vocabulary that pairs each residue's one-letter amino acid code with a corresponding structural state token derived from Foldseek's 3Di alphabet. This dual-token representation allows the model to learn the joint language of protein sequence and structure simultaneously, rather than treating structure as a secondary prediction target.
The 3Di tokens are generated by running Foldseek on experimental PDB structures or AlphaFold 2 predictions, encoding local backbone geometry and side-chain orientation into one of 20 discrete structural states. Each residue in the input is represented as a two-character token — for example, "Ac" where "A" is alanine and "c" is the corresponding 3Di state. Regions with low AlphaFold 2 confidence (pLDDT below 70) are masked with a placeholder character, preventing noisy structural assignments from degrading pre-training signal. This design allows SaProt to be applied to any protein with a known or predicted structure, which in practice means it can leverage the entire AlphaFold Database.
SaProt was presented as a spotlight paper at ICLR 2024, one of the most competitive machine learning venues, signaling broad recognition of its methodological contribution. The work was led by Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. A follow-on ecosystem, SaprotHub, was subsequently published in Nature Biotechnology, providing no-code fine-tuning and model-sharing infrastructure built on top of the SaProt foundation.
SaProt's 650M parameter flagship model shares its architecture with ESM-2 — a transformer encoder trained with masked language modeling — but extends the tokenizer to accommodate the paired sequence-structure vocabulary. Pre-training proceeded in two phases: an initial phase on approximately 40 million AlphaFold2-predicted structures covering the breadth of known protein sequence space, followed by a second phase incorporating roughly 60,000 experimental PDB structures to ground the model in high-quality empirical data. The largest 1.3B variant extends further, incorporating 200 million OMG_prot50 sequences and 150 million NCBI sequences filtered at 70% identity. Training the 650M model required 64 NVIDIA A100 80GB GPUs running for approximately three months.
Benchmark results from the 650M PDB-trained checkpoint are consistently above ESM-2 baselines of the same size: Spearman's rho on thermostability improves from 0.680 to 0.724, Fmax on enzyme classification improves from 0.868 to 0.882, and subcellular localization accuracy improves from 82.09% to 85.57%. Contact prediction tasks show especially pronounced gains, reflecting the structural information directly encoded in the 3Di tokens. At inference time, structural tokens can be generated on the fly by running Foldseek on any input PDB or CIF file, and the 1.3B model is documented to perform competitively even when only amino acid tokens are provided, making it applicable to proteins without available structures.
SaProt is well-suited for researchers working on protein function annotation, variant effect prediction, and protein engineering. Clinical researchers can apply its zero-shot mutation scoring to prioritize variants of uncertain significance in ClinVar-style databases. Protein engineers can use it to predict thermostability changes, enzyme activity, and binding affinity shifts without requiring labeled training data for each new protein family. Computational biologists can fine-tune task-specific models using SaprotHub's no-code Colab interface, which packages LoRA-based fine-tuning into a point-and-click workflow that runs on free-tier Colab hardware. The model is also a natural starting point for structural bioinformatics pipelines that combine Foldseek-based structure search with sequence-based machine learning.
SaProt represents a conceptually important step toward truly multimodal protein representations, demonstrating that integrating structural information directly into the input vocabulary — rather than through auxiliary supervision or post-hoc structure modules — yields consistent and substantial performance improvements across diverse tasks. Its ICLR 2024 Spotlight designation reflects peer recognition of this contribution in the broader machine learning community, not only within computational biology. The SaprotHub follow-on, published in Nature Biotechnology in 2025, has extended the model's reach by removing the technical barriers to fine-tuning and sharing, making SaProt accessible to wet-lab biologists and researchers without deep ML expertise. A practical limitation is the dependency on structural input: the 35M and 650M models rely on 3Di tokens for optimal performance, requiring access to AlphaFold2 predictions or experimental structures for every protein of interest.
Su, J., et al. (2024) SaProt: Protein Language Modeling with Structure-aware Vocabulary. bioRxiv.
DOI: 10.1101/2023.10.01.560349