Multi-modal protein language model that jointly learns from protein sequences and biomedical text, enabling zero-shot functional prediction and retrieval.
ProtST is a multi-modal protein language model developed by the DeepGraphLearning group that bridges protein sequence representations with biomedical natural language descriptions. Introduced in early 2023, it addresses a fundamental gap in conventional protein language models: while sequence-only models effectively capture co-evolutionary patterns, they lack explicit grounding in the biological functions those sequences encode. ProtST remedies this by jointly learning from both protein sequences and curated textual descriptions of protein function, cellular location, molecular mechanisms, and biological process involvement.
The core innovation is the construction of the ProtDescribe dataset, which pairs protein sequences with rich, multi-attribute text annotations drawn from biomedical databases and literature. Using this dataset, ProtST employs a three-objective pre-training strategy — unimodal masked prediction on sequences and text independently, contrastive alignment of the two modalities in a shared embedding space, and cross-modal masked prediction — to produce representations that are simultaneously sequence-aware and functionally interpretable.
The result is a model that can be used in both supervised and zero-shot settings, including retrieving proteins from large databases using free-text functional queries, without requiring explicit function labels at inference time.
ProtST combines two pretrained language model backbones: a protein language model (such as ESM) for encoding amino acid sequences and a biomedical language model (such as PubMedBERT) for encoding textual descriptions. A fusion module then aligns representations from these two encoders through contrastive learning, bringing functionally similar sequence-text pairs together in a shared latent space while pushing dissimilar pairs apart. This is the multimodal representation alignment objective. The two auxiliary objectives — masked token prediction on sequences and masked token prediction conditioned on cross-modal context — further reinforce both unimodal and cross-modal understanding.
Training data comes from the ProtDescribe dataset, which aggregates protein-text pairs from multiple curated sources to provide multi-attribute annotations. The dataset covers a broad range of proteins and functional categories, giving the model strong coverage of diverse biological contexts. At inference time, zero-shot classification is performed by encoding candidate text descriptions of class labels and ranking them by cosine similarity to the query protein's sequence embedding, with no additional training required on the target task.
ProtST is well-suited for researchers who need to interrogate protein function beyond what raw sequence similarity can reveal. Computational biologists can use it to classify proteins into functional categories without hand-labeled training data, making it valuable in exploratory settings where annotation is sparse or costly. Database curators and bioinformaticians can leverage its bidirectional retrieval capability to find proteins matching natural-language functional queries, or to surface relevant text descriptions for unannotated sequences. More broadly, ProtST's joint representation space enables downstream transfer learning for tasks such as enzyme function prediction, subcellular localization, and protein-protein interaction classification, with the multimodal pre-training providing a richer initialization than sequence-only models.
ProtST represents an important step toward grounding protein representation learning in biological knowledge expressed in natural language, an approach that has proven transformative in computer vision and general NLP. By demonstrating that text-guided pre-training meaningfully improves both zero-shot and supervised protein understanding, it opened a research direction that aligns with the broader trend toward multi-modal foundation models. The model has practical limitations worth noting: its zero-shot performance depends on the quality and completeness of the ProtDescribe dataset, and the choice of backbone PLM and BLM affects the ceiling of its capabilities. Proteins with rare functions or sparse annotation coverage in biomedical text may not benefit as substantially from the multimodal alignment. Nonetheless, ProtST established a compelling proof of concept that sequence and language representations can be productively unified for protein science.
Xu, M., Yuan, X., Miret, S., & Tang, J. (2023). ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. In International Conference on Machine Learning (ICML 2023).
DOI: 10.48550/arXiv.2301.12040