A unified multi-task framework that converts diverse protein prediction tasks into autoregressive next-token prediction using pre-trained protein language model encoders.
Prot2Token is a unified framework for protein prediction that reformulates diverse biological tasks as a single autoregressive next-token prediction problem. Developed by Mahdi Pourmirzaei, Duolin Wang, Dong Xu, and collaborators at the University of Missouri, the framework was introduced in a 2024 bioRxiv preprint and expanded in a 2025 arXiv revision. Its central insight is that nearly every protein prediction task — whether classification, regression, binding site detection, or structure prediction — can be expressed as a sequence of tokens, allowing a single general-purpose decoder to handle them all without task-specific architectural modifications.
The approach combines established protein language model encoders, primarily ESM2, with a lightweight autoregressive transformer decoder. The decoder is conditioned on encoder embeddings and guided by learnable task tokens, allowing the model to distinguish between different prediction objectives during a single multi-task training run. Target labels are tokenized in task-appropriate ways: class labels become discrete tokens, regression values are encoded digit-by-digit as character sequences, binding site residues are represented as sorted residue indices, and 3D structural coordinates are encoded using VQ-VAE structural tokens.
A key practical advantage of Prot2Token is its dramatically reduced inference time for structure-related tasks. By compressing structure prediction into the decoder's token generation pass, the framework achieves roughly 1,000-fold speedup over AlphaFold2 with MSA (1–2 seconds versus 18–25 minutes for a 384-residue protein), at the cost of some accuracy relative to specialized state-of-the-art methods. Across the broader task suite, Prot2Token matches or exceeds specialist models on many benchmarks, while offering the engineering advantage of a single deployable system.
prot2token) provides a simple interface for running inference on supported tasks without configuring training infrastructure.Prot2Token pairs a frozen or fine-tuned ESM2 protein encoder with a causal transformer decoder connected via cross-attention. The decoder configurations range from 4 layers and 8 attention heads (Prot2Token-A, paired with ESM2-35M) to 16 layers, 16 heads, and a feed-forward dimension of 5,120 (Prot2Token-D, paired with ESM2-3B). FlashAttention-2 is used throughout for memory efficiency. Training uses AdamW optimization with cosine annealing from 1e-6 to 5e-5, with 256 warmup steps. Full multi-task training in the largest configuration requires four Nvidia A100 80GB GPUs for approximately four days.
Across benchmark evaluations, Prot2Token achieves a Spearman correlation of 0.9294 on ProteinGym mutation stability (versus 0.613 for the prior best), a fluorescence Spearman of 0.78 with multi-task learning (versus 0.679 single-task), and an enzyme reaction classification accuracy improvement of 7.5 percentage points from multi-task learning. For localization prediction (DeepLoc 2.0), the model achieves a macro-F1 of 0.5364 versus a 0.46 baseline. Structure prediction on CAMEO 2024 yields a TM-score of 0.54, below ESMFold (0.79) but achieved roughly 1,000 times faster. The kinase phosphorylation site task reaches an F1 of 0.4966, outperforming GPS 6.0 (0.3076).
Prot2Token is suited for research groups that need to apply protein prediction across multiple task types without maintaining separate specialized pipelines. It is particularly relevant for mutation effect prediction in protein engineering campaigns, subcellular localization annotation in proteomics workflows, post-translational modification (PTM) site identification, protein-ligand and protein-protein binding site prediction, and rapid coarse-grained 3D structure estimation where throughput is more important than peak accuracy. The pip-installable package makes it accessible for bioinformatics researchers without deep machine learning infrastructure experience.
Prot2Token contributes a conceptual shift toward generalist protein prediction systems, demonstrating that a single autoregressive decoder architecture — already dominant in natural language processing — can be adapted to the heterogeneous label spaces of computational biology. The framework's multi-task learning gains provide practical evidence that joint training across protein prediction objectives yields measurable improvements over single-task specialization. The work has already stimulated follow-on preprints extending the approach to protein-ligand binding site prediction, kinase-substrate phosphorylation, and protein-protein structure similarity via post-training alignment. A noted limitation is that sequence-to-sequence tasks (e.g., secondary structure prediction) occasionally produce outputs of incorrect length, and the authors caution that the current implementation is not yet robust enough for production or commercial deployment.
Pourmirzaei, M., et al. (2024) Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling. bioRxiv.
DOI: 10.1101/2024.05.31.596915Pourmirzaei, M., et al. (2025) Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction. arXiv.org.
DOI: 10.48550/arXiv.2505.20589