Protein language model that integrates Gene Ontology knowledge graphs via contrastive learning, improving function prediction, protein-protein interaction, and TAPE benchmark tasks.
OntoProtein is a knowledge-enhanced protein language model developed by researchers at Zhejiang University and presented at ICLR 2022. It addresses a fundamental limitation of standard protein language models: they are trained exclusively on amino acid sequences, ignoring the rich structured biological knowledge that decades of experimental work have accumulated in curated databases such as Gene Ontology (GO). By fusing sequence-based learning with knowledge graph reasoning, OntoProtein produces protein representations that are more biologically grounded than those from sequence-only pretraining.
To enable this fusion, the authors constructed ProteinKG25, a large-scale knowledge graph that links GO terms to one another through ontological relationships (is-a, part-of) and to protein sequences through gene annotation records. During pretraining, the model jointly optimizes a masked language modeling objective over protein sequences and a knowledge graph embedding objective over GO relationships. A contrastive learning strategy with knowledge-aware negative sampling ties these two objectives together, ensuring that proteins are aligned with their correct GO annotations while being pulled away from semantically distinct terms.
OntoProtein builds on ProtBERT — a BERT-style transformer pretrained on UniRef100 protein sequences — as its sequence encoder, and incorporates PubMedBERT for encoding GO term text descriptions. This design allows it to leverage strong pretrained sequence representations while extending them with structured functional knowledge.
OntoProtein is built on a transformer backbone inherited from ProtBERT (trained on the ~250 million sequences in UniRef100) and augmented with a knowledge graph encoder for GO terms. The sequence encoder processes amino acid tokens with standard multi-head self-attention and masked language modeling, while the knowledge graph encoder represents GO entities and relations as dense embeddings trained with a contrastive ranking loss.
ProteinKG25 contains GO-GO triplets capturing ontological hierarchy and protein-GO triplets derived from gene annotation databases. Knowledge-aware negative sampling selects GO terms that are semantically distant from the positive annotation according to the GO graph, sharpening the contrastive signal. The two training objectives are combined through a weighted loss that is optimized jointly via backpropagation. The framework supports distributed training using DeepSpeed and is implemented in PyTorch with the Hugging Face Transformers library. Pretrained model weights are publicly available on Hugging Face.
OntoProtein is well suited for tasks where functional context improves predictive accuracy beyond what sequence alone provides. Researchers can apply it to protein function annotation — predicting biological process, molecular function, and cellular component GO terms for uncharacterized proteins — and to protein-protein interaction prediction, where shared functional relationships captured by GO provide informative priors. It also improves structural property prediction tasks from the TAPE benchmark, making it useful in workflows that screen large sequence libraries for stability or secondary structure composition. Teams working on drug target characterization or pathway analysis can use OntoProtein embeddings as richer feature representations for downstream classifiers without retraining the full model from scratch.
OntoProtein established a clear proof of concept that structured biological knowledge graphs can meaningfully improve protein language model representations, influencing subsequent work on knowledge-augmented pretraining in the protein domain. Its acceptance at ICLR 2022 brought the approach to the attention of the broader machine learning community at a time when protein language models were gaining rapid adoption. By releasing ProteinKG25 alongside the model weights, the team provided a resource that other groups could build on for knowledge-grounded protein representation learning. A key limitation is that the current framework integrates GO-level functional knowledge but does not incorporate structural databases such as the PDB, leaving room for future models to combine sequence, function, and structure in a unified pretraining objective.