Overview

ProtST is a multi-modal protein language model developed by the DeepGraphLearning group that bridges protein sequence representations with biomedical natural language descriptions. Introduced in early 2023, it addresses a fundamental gap in conventional protein language models: while sequence-only models effectively capture co-evolutionary patterns, they lack explicit grounding in the biological functions those sequences encode. ProtST remedies this by jointly learning from both protein sequences and curated textual descriptions of protein function, cellular location, molecular mechanisms, and biological process involvement.

The core innovation is the construction of the ProtDescribe dataset, which pairs protein sequences with rich, multi-attribute text annotations drawn from biomedical databases and literature. Using this dataset, ProtST employs a three-objective pre-training strategy — unimodal masked prediction on sequences and text independently, contrastive alignment of the two modalities in a shared embedding space, and cross-modal masked prediction — to produce representations that are simultaneously sequence-aware and functionally interpretable.

The result is a model that can be used in both supervised and zero-shot settings, including retrieving proteins from large databases using free-text functional queries, without requiring explicit function labels at inference time.

Key Features

Multi-Modal Architecture: Combines a protein language model (PLM) for sequence encoding, a biomedical language model (BLM) for text encoding, and a fusion module that aligns representations across both modalities in a shared embedding space.
ProtDescribe Dataset: A purpose-built dataset of protein sequence and text description pairs, sourced from curated databases to cover molecular function, biological process, cellular component, and other key properties.
Three-Task Pre-Training: Simultaneously optimizes unimodal mask prediction, multimodal contrastive alignment, and multimodal mask prediction, ensuring robust learning of both within-modality and cross-modal structure.
Zero-Shot Functional Prediction: Enables classification and retrieval of proteins by functional category without task-specific fine-tuning, by comparing sequence embeddings against text embeddings of class label descriptions.
Enhanced Supervised Performance: Pre-trained multi-modal representations improve downstream task performance on standard protein function benchmarks relative to sequence-only baselines.
Bidirectional Search: Supports both sequence-to-text and text-to-sequence retrieval, enabling natural-language queries against large protein databases.

Technical Details

ProtST combines two pretrained language model backbones: a protein language model (such as ESM) for encoding amino acid sequences and a biomedical language model (such as PubMedBERT) for encoding textual descriptions. A fusion module then aligns representations from these two encoders through contrastive learning, bringing functionally similar sequence-text pairs together in a shared latent space while pushing dissimilar pairs apart. This is the multimodal representation alignment objective. The two auxiliary objectives — masked token prediction on sequences and masked token prediction conditioned on cross-modal context — further reinforce both unimodal and cross-modal understanding.

Training data comes from the ProtDescribe dataset, which aggregates protein-text pairs from multiple curated sources to provide multi-attribute annotations. The dataset covers a broad range of proteins and functional categories, giving the model strong coverage of diverse biological contexts. At inference time, zero-shot classification is performed by encoding candidate text descriptions of class labels and ranking them by cosine similarity to the query protein's sequence embedding, with no additional training required on the target task.

Applications

ProtST is well-suited for researchers who need to interrogate protein function beyond what raw sequence similarity can reveal. Computational biologists can use it to classify proteins into functional categories without hand-labeled training data, making it valuable in exploratory settings where annotation is sparse or costly. Database curators and bioinformaticians can leverage its bidirectional retrieval capability to find proteins matching natural-language functional queries, or to surface relevant text descriptions for unannotated sequences. More broadly, ProtST's joint representation space enables downstream transfer learning for tasks such as enzyme function prediction, subcellular localization, and protein-protein interaction classification, with the multimodal pre-training providing a richer initialization than sequence-only models.

Impact

ProtST represents an important step toward grounding protein representation learning in biological knowledge expressed in natural language, an approach that has proven transformative in computer vision and general NLP. By demonstrating that text-guided pre-training meaningfully improves both zero-shot and supervised protein understanding, it opened a research direction that aligns with the broader trend toward multi-modal foundation models. The model has practical limitations worth noting: its zero-shot performance depends on the quality and completeness of the ProtDescribe dataset, and the choice of backbone PLM and BLM affects the ceiling of its capabilities. Proteins with rare functions or sparse annotation coverage in biomedical text may not benefit as substantially from the multimodal alignment. Nonetheless, ProtST established a compelling proof of concept that sequence and language representations can be productively unified for protein science.

Citation

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Preprint

Xu, M., Yuan, X., Miret, S., & Tang, J. (2023). ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. In International Conference on Machine Learning (ICML 2023).

DOI: 10.48550/arXiv.2301.12040

Overview

Key Features

Multi-Modal Architecture: Combines a protein language model (PLM) for sequence encoding, a biomedical language model (BLM) for text encoding, and a fusion module that aligns representations across both modalities in a shared embedding space.

ProtDescribe Dataset: A purpose-built dataset of protein sequence and text description pairs, sourced from curated databases to cover molecular function, biological process, cellular component, and other key properties.

Three-Task Pre-Training: Simultaneously optimizes unimodal mask prediction, multimodal contrastive alignment, and multimodal mask prediction, ensuring robust learning of both within-modality and cross-modal structure.

Zero-Shot Functional Prediction: Enables classification and retrieval of proteins by functional category without task-specific fine-tuning, by comparing sequence embeddings against text embeddings of class label descriptions.

Enhanced Supervised Performance: Pre-trained multi-modal representations improve downstream task performance on standard protein function benchmarks relative to sequence-only baselines.

Bidirectional Search: Supports both sequence-to-text and text-to-sequence retrieval, enabling natural-language queries against large protein databases.

Technical Details

Applications

Impact

Citation

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Preprint

Xu, M., Yuan, X., Miret, S., & Tang, J. (2023). ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. In International Conference on Machine Learning (ICML 2023).

DOI: 10.48550/arXiv.2301.12040

ProtST

Overview

Key Features

Technical Details

Applications

Impact

Citation

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Metrics

GitHub

Citations

Tags

Resources

ProtST

Overview

Key Features

Technical Details

Applications

Impact

Citation

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Metrics

GitHub

Citations

Tags

Resources