Overview

OntoProtein is a knowledge-enhanced protein language model developed by researchers at Zhejiang University and presented at ICLR 2022. It addresses a fundamental limitation of standard protein language models: they are trained exclusively on amino acid sequences, ignoring the rich structured biological knowledge that decades of experimental work have accumulated in curated databases such as Gene Ontology (GO). By fusing sequence-based learning with knowledge graph reasoning, OntoProtein produces protein representations that are more biologically grounded than those from sequence-only pretraining.

To enable this fusion, the authors constructed ProteinKG25, a large-scale knowledge graph that links GO terms to one another through ontological relationships (is-a, part-of) and to protein sequences through gene annotation records. During pretraining, the model jointly optimizes a masked language modeling objective over protein sequences and a knowledge graph embedding objective over GO relationships. A contrastive learning strategy with knowledge-aware negative sampling ties these two objectives together, ensuring that proteins are aligned with their correct GO annotations while being pulled away from semantically distinct terms.

OntoProtein builds on ProtBERT — a BERT-style transformer pretrained on UniRef100 protein sequences — as its sequence encoder, and incorporates PubMedBERT for encoding GO term text descriptions. This design allows it to leverage strong pretrained sequence representations while extending them with structured functional knowledge.

Key Features

Gene Ontology integration: ProteinKG25 encodes GO-GO relationships (ontology structure) and protein-GO relationships (gene annotations), giving the model access to hierarchical biological process, molecular function, and cellular component information during pretraining.
Dual-objective pretraining: The model is trained simultaneously with masked language modeling on protein sequences and knowledge graph embedding on GO relationships, so the learned representations capture both sequence statistics and functional semantics.
Knowledge-aware negative sampling: Contrastive learning uses negatives selected to respect the GO hierarchy, preventing semantically similar GO terms from being treated as hard negatives and improving representation quality.
TAPE benchmark gains: OntoProtein surpasses ProtBERT and other pretrained protein language models on multiple TAPE tasks including secondary structure prediction, contact prediction, remote homology detection, fluorescence prediction, and stability prediction.
Protein-protein interaction improvement: Incorporating GO functional context yields measurable gains in PPI prediction compared to sequence-only baselines, reflecting the importance of shared functional context in interaction networks.

Technical Details

OntoProtein is built on a transformer backbone inherited from ProtBERT (trained on the ~250 million sequences in UniRef100) and augmented with a knowledge graph encoder for GO terms. The sequence encoder processes amino acid tokens with standard multi-head self-attention and masked language modeling, while the knowledge graph encoder represents GO entities and relations as dense embeddings trained with a contrastive ranking loss.

ProteinKG25 contains GO-GO triplets capturing ontological hierarchy and protein-GO triplets derived from gene annotation databases. Knowledge-aware negative sampling selects GO terms that are semantically distant from the positive annotation according to the GO graph, sharpening the contrastive signal. The two training objectives are combined through a weighted loss that is optimized jointly via backpropagation. The framework supports distributed training using DeepSpeed and is implemented in PyTorch with the Hugging Face Transformers library. Pretrained model weights are publicly available on Hugging Face.

Applications

OntoProtein is well suited for tasks where functional context improves predictive accuracy beyond what sequence alone provides. Researchers can apply it to protein function annotation — predicting biological process, molecular function, and cellular component GO terms for uncharacterized proteins — and to protein-protein interaction prediction, where shared functional relationships captured by GO provide informative priors. It also improves structural property prediction tasks from the TAPE benchmark, making it useful in workflows that screen large sequence libraries for stability or secondary structure composition. Teams working on drug target characterization or pathway analysis can use OntoProtein embeddings as richer feature representations for downstream classifiers without retraining the full model from scratch.

Impact

OntoProtein established a clear proof of concept that structured biological knowledge graphs can meaningfully improve protein language model representations, influencing subsequent work on knowledge-augmented pretraining in the protein domain. Its acceptance at ICLR 2022 brought the approach to the attention of the broader machine learning community at a time when protein language models were gaining rapid adoption. By releasing ProteinKG25 alongside the model weights, the team provided a resource that other groups could build on for knowledge-grounded protein representation learning. A key limitation is that the current framework integrates GO-level functional knowledge but does not incorporate structural databases such as the PDB, leaving room for future models to combine sequence, function, and structure in a unified pretraining objective.

Overview

Key Features

Gene Ontology integration: ProteinKG25 encodes GO-GO relationships (ontology structure) and protein-GO relationships (gene annotations), giving the model access to hierarchical biological process, molecular function, and cellular component information during pretraining.

Dual-objective pretraining: The model is trained simultaneously with masked language modeling on protein sequences and knowledge graph embedding on GO relationships, so the learned representations capture both sequence statistics and functional semantics.

Knowledge-aware negative sampling: Contrastive learning uses negatives selected to respect the GO hierarchy, preventing semantically similar GO terms from being treated as hard negatives and improving representation quality.

TAPE benchmark gains: OntoProtein surpasses ProtBERT and other pretrained protein language models on multiple TAPE tasks including secondary structure prediction, contact prediction, remote homology detection, fluorescence prediction, and stability prediction.

Protein-protein interaction improvement: Incorporating GO functional context yields measurable gains in PPI prediction compared to sequence-only baselines, reflecting the importance of shared functional context in interaction networks.

Technical Details

Applications

Impact

OntoProtein

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

HuggingFace

Tags

Resources

OntoProtein

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

HuggingFace

Tags

Resources