Overview

Prot2Token is a unified framework for protein prediction that reformulates diverse biological tasks as a single autoregressive next-token prediction problem. Developed by Mahdi Pourmirzaei, Duolin Wang, Dong Xu, and collaborators at the University of Missouri, the framework was introduced in a 2024 bioRxiv preprint and expanded in a 2025 arXiv revision. Its central insight is that nearly every protein prediction task — whether classification, regression, binding site detection, or structure prediction — can be expressed as a sequence of tokens, allowing a single general-purpose decoder to handle them all without task-specific architectural modifications.

The approach combines established protein language model encoders, primarily ESM2, with a lightweight autoregressive transformer decoder. The decoder is conditioned on encoder embeddings and guided by learnable task tokens, allowing the model to distinguish between different prediction objectives during a single multi-task training run. Target labels are tokenized in task-appropriate ways: class labels become discrete tokens, regression values are encoded digit-by-digit as character sequences, binding site residues are represented as sorted residue indices, and 3D structural coordinates are encoded using VQ-VAE structural tokens.

A key practical advantage of Prot2Token is its dramatically reduced inference time for structure-related tasks. By compressing structure prediction into the decoder's token generation pass, the framework achieves roughly 1,000-fold speedup over AlphaFold2 with MSA (1–2 seconds versus 18–25 minutes for a 384-residue protein), at the cost of some accuracy relative to specialized state-of-the-art methods. Across the broader task suite, Prot2Token matches or exceeds specialist models on many benchmarks, while offering the engineering advantage of a single deployable system.

Key Features

Unified tokenization interface: All prediction targets — including multi-class labels, real-valued regression outputs, residue-level binding sites, and 3D structural coordinates — are converted into token sequences, enabling a single autoregressive decoder to address all task types without architectural changes.
Multi-task learning in one training run: Learnable task-prompt tokens condition the decoder on the target task at inference time, allowing joint training across heterogeneous protein prediction objectives and enabling cross-task regularization that can improve individual task performance.
Flexible encoder compatibility: The framework is encoder-agnostic and has been tested with ESM2 at three scales (35M, 650M, and 3B parameters) and with BARTSmiles as an auxiliary chemical encoder for protein-ligand tasks.
Self-supervised decoder pre-training: An auxiliary pre-training stage on protein sequence reconstruction improves decoder quality on spatially sensitive tasks such as binding site prediction and structure generation.
Modular multi-scale decoder configurations: Four decoder variants (Prot2Token-A through -D) span a range of sizes and capacity, from a lightweight 4-layer, 8-head model paired with ESM2-35m up to a 16-layer, 16-head model paired with ESM2-3B, enabling deployment across different resource budgets.
pip-installable inference API: A PyPI package (prot2token) provides a simple interface for running inference on supported tasks without configuring training infrastructure.

Technical Details

Prot2Token pairs a frozen or fine-tuned ESM2 protein encoder with a causal transformer decoder connected via cross-attention. The decoder configurations range from 4 layers and 8 attention heads (Prot2Token-A, paired with ESM2-35M) to 16 layers, 16 heads, and a feed-forward dimension of 5,120 (Prot2Token-D, paired with ESM2-3B). FlashAttention-2 is used throughout for memory efficiency. Training uses AdamW optimization with cosine annealing from 1e-6 to 5e-5, with 256 warmup steps. Full multi-task training in the largest configuration requires four Nvidia A100 80GB GPUs for approximately four days.

Across benchmark evaluations, Prot2Token achieves a Spearman correlation of 0.9294 on ProteinGym mutation stability (versus 0.613 for the prior best), a fluorescence Spearman of 0.78 with multi-task learning (versus 0.679 single-task), and an enzyme reaction classification accuracy improvement of 7.5 percentage points from multi-task learning. For localization prediction (DeepLoc 2.0), the model achieves a macro-F1 of 0.5364 versus a 0.46 baseline. Structure prediction on CAMEO 2024 yields a TM-score of 0.54, below ESMFold (0.79) but achieved roughly 1,000 times faster. The kinase phosphorylation site task reaches an F1 of 0.4966, outperforming GPS 6.0 (0.3076).

Applications

Prot2Token is suited for research groups that need to apply protein prediction across multiple task types without maintaining separate specialized pipelines. It is particularly relevant for mutation effect prediction in protein engineering campaigns, subcellular localization annotation in proteomics workflows, post-translational modification (PTM) site identification, protein-ligand and protein-protein binding site prediction, and rapid coarse-grained 3D structure estimation where throughput is more important than peak accuracy. The pip-installable package makes it accessible for bioinformatics researchers without deep machine learning infrastructure experience.

Impact

Prot2Token contributes a conceptual shift toward generalist protein prediction systems, demonstrating that a single autoregressive decoder architecture — already dominant in natural language processing — can be adapted to the heterogeneous label spaces of computational biology. The framework's multi-task learning gains provide practical evidence that joint training across protein prediction objectives yields measurable improvements over single-task specialization. The work has already stimulated follow-on preprints extending the approach to protein-ligand binding site prediction, kinase-substrate phosphorylation, and protein-protein structure similarity via post-training alignment. A noted limitation is that sequence-to-sequence tasks (e.g., secondary structure prediction) occasionally produce outputs of incorrect length, and the authors caution that the current implementation is not yet robust enough for production or commercial deployment.

Citations

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Preprint

Pourmirzaei, M., et al. (2024) Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling. bioRxiv.

DOI: 10.1101/2024.05.31.596915

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Preprint

Pourmirzaei, M., et al. (2025) Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction. arXiv.org.

DOI: 10.48550/arXiv.2505.20589

Overview

Key Features

Unified tokenization interface: All prediction targets — including multi-class labels, real-valued regression outputs, residue-level binding sites, and 3D structural coordinates — are converted into token sequences, enabling a single autoregressive decoder to address all task types without architectural changes.

Multi-task learning in one training run: Learnable task-prompt tokens condition the decoder on the target task at inference time, allowing joint training across heterogeneous protein prediction objectives and enabling cross-task regularization that can improve individual task performance.

Flexible encoder compatibility: The framework is encoder-agnostic and has been tested with ESM2 at three scales (35M, 650M, and 3B parameters) and with BARTSmiles as an auxiliary chemical encoder for protein-ligand tasks.

Self-supervised decoder pre-training: An auxiliary pre-training stage on protein sequence reconstruction improves decoder quality on spatially sensitive tasks such as binding site prediction and structure generation.

Modular multi-scale decoder configurations: Four decoder variants (Prot2Token-A through -D) span a range of sizes and capacity, from a lightweight 4-layer, 8-head model paired with ESM2-35m up to a 16-layer, 16-head model paired with ESM2-3B, enabling deployment across different resource budgets.

pip-installable inference API: A PyPI package (prot2token) provides a simple interface for running inference on supported tasks without configuring training infrastructure.

Technical Details

Applications

Impact

Citations

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Preprint

Pourmirzaei, M., et al. (2024) Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling. bioRxiv.

DOI: 10.1101/2024.05.31.596915

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Preprint

Pourmirzaei, M., et al. (2025) Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction. arXiv.org.

DOI: 10.48550/arXiv.2505.20589

Prot2Token

Overview

Key Features

Technical Details

Applications

Impact

Citations

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Metrics

GitHub

Citations

Tags

Resources

Prot2Token

Overview

Key Features

Technical Details

Applications

Impact

Citations

Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Metrics

GitHub

Citations

Tags

Resources