DeepGO

Protein function prediction models that assign Gene Ontology terms using language model embeddings and neuro-symbolic reasoning over GO axioms.

Released: January 2024

DeepGO is a family of deep learning methods developed by the Bio-Ontology Research Group at KAUST for predicting protein functions expressed as Gene Ontology (GO) annotations. The most recent version, DeepGO-SE (Semantic Entailment), frames protein function prediction as a logical reasoning problem: rather than treating GO terms as independent labels to be classified, the model treats function assignment as approximate semantic entailment over GO's formal axiom system, integrating pretrained protein language model embeddings with neuro-symbolic inference.

The series has advanced through successive generations. The original DeepGO (2017) applied deep learning to protein sequence and interaction data for ontology-aware classification. DeepGOPlus (2019) shifted to convolutional sequence-only models and reached competitive performance in the CAFA (Critical Assessment of Functional Annotation) challenge. DeepGO-SE (2024), published in Nature Machine Intelligence, represents the current state of the art. By exploiting more than 100,000 GO axioms encoding subsumption, disjointness, and domain-range constraints, it achieves logically consistent predictions and is especially robust for proteins lacking sequence similarity to any annotated training example.

This capability matters because a large fraction of proteins in newly sequenced genomes and metagenomes have no close homologs in annotated databases. Homology transfer — the standard approach used by tools like BLAST-based annotation pipelines — fails precisely where it is needed most. DeepGO-SE is designed to operate reliably in this regime.

Key Features

Neuro-symbolic architecture: Combines pretrained protein language model embeddings with a symbolic reasoning layer operating over multiple approximate models of the Gene Ontology, integrating data-driven learning with formal logical constraints.
Semantic entailment framework: Formulates function prediction as checking whether a GO term statement is entailed by the model of the protein, ensuring predictions respect the logical structure of GO rather than treating terms as independent labels.
Ontology-aware consistency: Exploits GO's hierarchical axioms — subsumption, disjointness, and domain-range restrictions — so that predicted annotations are logically coherent across the full GO graph.
Multi-ontology support: Provides predictions across all three GO sub-ontologies — Molecular Function (MFO), Biological Process (BPO), and Cellular Component (CCO) — with sub-ontology-specific adaptations.
High-throughput inference: DeepGOPlus annotates approximately 40 protein sequences per second on standard hardware, enabling practical use for proteome-scale annotation projects.
Robustness to novel proteins: Explicitly evaluated and designed for low-homology proteins, where sequence-similarity-based methods produce few or no annotations.

Technical Details

DeepGO-SE employs a two-component architecture. A pretrained protein language model processes amino acid sequences and produces per-protein embeddings that capture sequence-level patterns and evolutionary context. These embeddings feed into a neuro-symbolic reasoning module that constructs approximate models of the Gene Ontology — finite interpretations that satisfy GO's axioms — and predicts the truth value of GO term membership statements for each protein. This formulation allows the model to propagate logical constraints across GO's hierarchical structure during inference rather than treating each term as an isolated binary classifier.

Training uses experimentally validated protein-function associations from the Gene Ontology Annotation (GOA) database, which aggregates curated annotations across model organisms and multiple evidence codes. The loss function accounts for the hierarchical structure of GO: predictions for more specific child terms are penalized consistently with their parent terms. On CAFA benchmark evaluations, DeepGOPlus achieved Fmax scores of 0.390 (BPO), 0.557 (MFO), and 0.614 (CCO), placing among top methods. DeepGO-SE further improves these metrics, with the largest gains observed on low-homology protein subsets where other methods degrade substantially.

Applications

DeepGO is suited to any task requiring automated functional annotation of proteins, particularly where experimental data is unavailable. Genome annotation pipelines for newly sequenced organisms can use DeepGO to assign preliminary GO terms to uncharacterized open reading frames. Metagenomics workflows — supported by the DeepGOMeta variant — can apply functional profiling to environmental samples containing many proteins with no reference in curated databases. Drug discovery teams can use predicted molecular function annotations to identify proteins with enzymatic activities or binding properties relevant to a target indication. Pathway reconstruction and comparative genomics analyses benefit from consistent, logically coherent annotations across multiple species, enabling functional ortholog identification even across large evolutionary distances.

Impact

DeepGO-SE, published in Nature Machine Intelligence in 2024, advances the field by demonstrating that neuro-symbolic approaches can outperform purely data-driven classifiers for structured prediction tasks where domain knowledge is available in formal ontological form. The Gene Ontology community has historically relied on homology-based propagation for unannotated proteins; DeepGO provides a principled alternative for the rapidly growing fraction of sequences without useful homologs. The model's public availability on GitHub and the existence of a companion web server lower the barrier for biologists who lack machine learning infrastructure. A key limitation is that DeepGO-SE inherits the quality and coverage of the GOA training corpus: functions with few experimentally validated examples remain difficult to predict reliably, and the model does not currently address context-dependent function (e.g., tissue-specific or condition-dependent activity).

Citation

Protein function prediction as approximate semantic entailment

Kulmanov M, Guzmán-Vega FJ, Duek Roggli P, Lane L, Arold ST, Hoehndorf R. Protein function prediction as approximate semantic entailment. Nat Mach Intell. 2024;6:220-228.

DOI: 10.1038/s42256-024-00795-w

Recent citations

Papers that recently cited this model.

BulkFormer: A large-scale foundation model for bulk transcriptomes.
Boming Kang, Rui Fan, M. Yi, et al.
Cell Systems · Jul 2026
0
M2GO: Multimodal protein function prediction via heterogeneous expert interaction
Xiaoling Luo, Ruli Zheng, Peng Chen, et al.
Pattern Recognition · Jul 2026
0
Retrieval-Augmented Multimodal Learning for Enzyme-Substrate Interaction Prediction Under Low-Homology Shift
Chen Liu, Bingxin Zhou, Xinyuan Wang, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

The Gene Ontology knowledgebase in 2026
Suzi A James P Seth J Michael Dustin Marc Pascale Nomi Aleksander Balhoff Carbon Cherry Ebert Feuermann G, Suzi A Aleksander, J. Balhoff, et al.
Nucleic Acids Research · Dec 2025
56
Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review
Jiaying Chen, Jingfu Wang, Yue Hu, et al.
Frontiers in Bioengineering and Biotechnology · Jan 2025
43
Ontology Embedding: A Survey of Methods, Applications and Resources
Jiaoyan Chen, O. Mashkova, Fernando Zhapa-Camacho, et al.
IEEE Transactions on Knowledge and Data Engineering · Jun 2024
28
AI-driven de novo enzyme design: Strategies, applications, and future prospects.
Xi-Chen Cui, Yangqi Zheng, Ye Liu, et al.
Biotechnology Advances · May 2025
24
Accelerating drug discovery, development, and clinical trials by artificial intelligence.
Yilun Zhang, Mohamed Mastouri, Yang Zhang
i Medicina · Aug 2024
21

Citations

Total Citations106

Influential8

References66

GitHub

Stars61

Forks11

Open Issues10

Contributors3

Last Push1y ago

LanguagePython

LicenseBSD-3-Clause

Fields of citing research

Biology88%
Computer Science88%
Medicine65%
Environmental Science11%
Chemistry9%
Engineering4%
Agricultural and Food Sciences4%
Materials Science3%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

63Partial

Usability — can I run it?72

Reproducibility — can I retrain it?54

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website Documentation Dataset

Key Features

Neuro-symbolic architecture: Combines pretrained protein language model embeddings with a symbolic reasoning layer operating over multiple approximate models of the Gene Ontology, integrating data-driven learning with formal logical constraints.

Semantic entailment framework: Formulates function prediction as checking whether a GO term statement is entailed by the model of the protein, ensuring predictions respect the logical structure of GO rather than treating terms as independent labels.

Ontology-aware consistency: Exploits GO's hierarchical axioms — subsumption, disjointness, and domain-range restrictions — so that predicted annotations are logically coherent across the full GO graph.

Multi-ontology support: Provides predictions across all three GO sub-ontologies — Molecular Function (MFO), Biological Process (BPO), and Cellular Component (CCO) — with sub-ontology-specific adaptations.

High-throughput inference: DeepGOPlus annotates approximately 40 protein sequences per second on standard hardware, enabling practical use for proteome-scale annotation projects.

Robustness to novel proteins: Explicitly evaluated and designed for low-homology proteins, where sequence-similarity-based methods produce few or no annotations.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

BulkFormer: A large-scale foundation model for bulk transcriptomes.

Boming Kang, Rui Fan, M. Yi, et al.

Cell Systems · Jul 2026

M2GO: Multimodal protein function prediction via heterogeneous expert interaction

Xiaoling Luo, Ruli Zheng, Peng Chen, et al.

Pattern Recognition · Jul 2026

Retrieval-Augmented Multimodal Learning for Enzyme-Substrate Interaction Prediction Under Low-Homology Shift

Chen Liu, Bingxin Zhou, Xinyuan Wang, et al.

Jun 2026

DeepGO

#Key Features

#Technical Details

#Applications

#Impact

Citation

Protein function prediction as approximate semantic entailment

Recent citations

Retrieval-Augmented Multimodal Learning for Enzyme-Substrate Interaction Prediction Under Low-Homology Shift

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

DeepGO

#Key Features

#Technical Details

#Applications

#Impact

Citation

Protein function prediction as approximate semantic entailment

Recent citations

Retrieval-Augmented Multimodal Learning for Enzyme-Substrate Interaction Prediction Under Low-Homology Shift

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact