bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

DeepGO

Bio-Ontology Research Group

Deep learning method for protein function prediction using Gene Ontology annotations, combining protein language models with neuro-symbolic reasoning over GO axioms.

Released: 2024

Overview

DeepGO is a family of deep learning methods developed by the Bio-Ontology Research Group at KAUST for predicting protein functions expressed as Gene Ontology (GO) annotations. The most recent version, DeepGO-SE (Semantic Entailment), frames protein function prediction as a logical reasoning problem: rather than treating GO terms as independent labels to be classified, the model treats function assignment as approximate semantic entailment over GO's formal axiom system, integrating pretrained protein language model embeddings with neuro-symbolic inference.

The series has advanced through successive generations. The original DeepGO (2017) applied deep learning to protein sequence and interaction data for ontology-aware classification. DeepGOPlus (2019) shifted to convolutional sequence-only models and reached competitive performance in the CAFA (Critical Assessment of Functional Annotation) challenge. DeepGO-SE (2024), published in Nature Machine Intelligence, represents the current state of the art. By exploiting more than 100,000 GO axioms encoding subsumption, disjointness, and domain-range constraints, it achieves logically consistent predictions and is especially robust for proteins lacking sequence similarity to any annotated training example.

This capability matters because a large fraction of proteins in newly sequenced genomes and metagenomes have no close homologs in annotated databases. Homology transfer — the standard approach used by tools like BLAST-based annotation pipelines — fails precisely where it is needed most. DeepGO-SE is designed to operate reliably in this regime.

Key Features

  • Neuro-symbolic architecture: Combines pretrained protein language model embeddings with a symbolic reasoning layer operating over multiple approximate models of the Gene Ontology, integrating data-driven learning with formal logical constraints.
  • Semantic entailment framework: Formulates function prediction as checking whether a GO term statement is entailed by the model of the protein, ensuring predictions respect the logical structure of GO rather than treating terms as independent labels.
  • Ontology-aware consistency: Exploits GO's hierarchical axioms — subsumption, disjointness, and domain-range restrictions — so that predicted annotations are logically coherent across the full GO graph.
  • Multi-ontology support: Provides predictions across all three GO sub-ontologies — Molecular Function (MFO), Biological Process (BPO), and Cellular Component (CCO) — with sub-ontology-specific adaptations.
  • High-throughput inference: DeepGOPlus annotates approximately 40 protein sequences per second on standard hardware, enabling practical use for proteome-scale annotation projects.
  • Robustness to novel proteins: Explicitly evaluated and designed for low-homology proteins, where sequence-similarity-based methods produce few or no annotations.

Technical Details

DeepGO-SE employs a two-component architecture. A pretrained protein language model processes amino acid sequences and produces per-protein embeddings that capture sequence-level patterns and evolutionary context. These embeddings feed into a neuro-symbolic reasoning module that constructs approximate models of the Gene Ontology — finite interpretations that satisfy GO's axioms — and predicts the truth value of GO term membership statements for each protein. This formulation allows the model to propagate logical constraints across GO's hierarchical structure during inference rather than treating each term as an isolated binary classifier.

Training uses experimentally validated protein-function associations from the Gene Ontology Annotation (GOA) database, which aggregates curated annotations across model organisms and multiple evidence codes. The loss function accounts for the hierarchical structure of GO: predictions for more specific child terms are penalized consistently with their parent terms. On CAFA benchmark evaluations, DeepGOPlus achieved Fmax scores of 0.390 (BPO), 0.557 (MFO), and 0.614 (CCO), placing among top methods. DeepGO-SE further improves these metrics, with the largest gains observed on low-homology protein subsets where other methods degrade substantially.

Applications

DeepGO is suited to any task requiring automated functional annotation of proteins, particularly where experimental data is unavailable. Genome annotation pipelines for newly sequenced organisms can use DeepGO to assign preliminary GO terms to uncharacterized open reading frames. Metagenomics workflows — supported by the DeepGOMeta variant — can apply functional profiling to environmental samples containing many proteins with no reference in curated databases. Drug discovery teams can use predicted molecular function annotations to identify proteins with enzymatic activities or binding properties relevant to a target indication. Pathway reconstruction and comparative genomics analyses benefit from consistent, logically coherent annotations across multiple species, enabling functional ortholog identification even across large evolutionary distances.

Impact

DeepGO-SE, published in Nature Machine Intelligence in 2024, advances the field by demonstrating that neuro-symbolic approaches can outperform purely data-driven classifiers for structured prediction tasks where domain knowledge is available in formal ontological form. The Gene Ontology community has historically relied on homology-based propagation for unannotated proteins; DeepGO provides a principled alternative for the rapidly growing fraction of sequences without useful homologs. The model's public availability on GitHub and the existence of a companion web server lower the barrier for biologists who lack machine learning infrastructure. A key limitation is that DeepGO-SE inherits the quality and coverage of the GOA training corpus: functions with few experimentally validated examples remain difficult to predict reliably, and the model does not currently address context-dependent function (e.g., tissue-specific or condition-dependent activity).

Citation

Protein function prediction as approximate semantic entailment

Kulmanov M, Guzmán-Vega FJ, Duek Roggli P, Lane L, Arold ST, Hoehndorf R. Protein function prediction as approximate semantic entailment. Nat Mach Intell. 2024;6:220-228.

DOI: 10.1038/s42256-024-00795-w

Metrics

GitHub

Stars57
Forks10
Open Issues9
Contributors3
Last Push1y ago
LanguagePython
LicenseBSD-3-Clause

Citations

Total Citations95
Influential8
References66

Tags

gene ontologyprotein function prediction

Resources

GitHub RepositoryResearch Paper