Deep learning method for protein function prediction using Gene Ontology annotations, combining protein language models with neuro-symbolic reasoning over GO axioms.
DeepGO is a family of deep learning methods developed by the Bio-Ontology Research Group at KAUST for predicting protein functions expressed as Gene Ontology (GO) annotations. The most recent version, DeepGO-SE (Semantic Entailment), frames protein function prediction as a logical reasoning problem: rather than treating GO terms as independent labels to be classified, the model treats function assignment as approximate semantic entailment over GO's formal axiom system, integrating pretrained protein language model embeddings with neuro-symbolic inference.
The series has advanced through successive generations. The original DeepGO (2017) applied deep learning to protein sequence and interaction data for ontology-aware classification. DeepGOPlus (2019) shifted to convolutional sequence-only models and reached competitive performance in the CAFA (Critical Assessment of Functional Annotation) challenge. DeepGO-SE (2024), published in Nature Machine Intelligence, represents the current state of the art. By exploiting more than 100,000 GO axioms encoding subsumption, disjointness, and domain-range constraints, it achieves logically consistent predictions and is especially robust for proteins lacking sequence similarity to any annotated training example.
This capability matters because a large fraction of proteins in newly sequenced genomes and metagenomes have no close homologs in annotated databases. Homology transfer — the standard approach used by tools like BLAST-based annotation pipelines — fails precisely where it is needed most. DeepGO-SE is designed to operate reliably in this regime.
DeepGO-SE employs a two-component architecture. A pretrained protein language model processes amino acid sequences and produces per-protein embeddings that capture sequence-level patterns and evolutionary context. These embeddings feed into a neuro-symbolic reasoning module that constructs approximate models of the Gene Ontology — finite interpretations that satisfy GO's axioms — and predicts the truth value of GO term membership statements for each protein. This formulation allows the model to propagate logical constraints across GO's hierarchical structure during inference rather than treating each term as an isolated binary classifier.
Training uses experimentally validated protein-function associations from the Gene Ontology Annotation (GOA) database, which aggregates curated annotations across model organisms and multiple evidence codes. The loss function accounts for the hierarchical structure of GO: predictions for more specific child terms are penalized consistently with their parent terms. On CAFA benchmark evaluations, DeepGOPlus achieved Fmax scores of 0.390 (BPO), 0.557 (MFO), and 0.614 (CCO), placing among top methods. DeepGO-SE further improves these metrics, with the largest gains observed on low-homology protein subsets where other methods degrade substantially.
DeepGO is suited to any task requiring automated functional annotation of proteins, particularly where experimental data is unavailable. Genome annotation pipelines for newly sequenced organisms can use DeepGO to assign preliminary GO terms to uncharacterized open reading frames. Metagenomics workflows — supported by the DeepGOMeta variant — can apply functional profiling to environmental samples containing many proteins with no reference in curated databases. Drug discovery teams can use predicted molecular function annotations to identify proteins with enzymatic activities or binding properties relevant to a target indication. Pathway reconstruction and comparative genomics analyses benefit from consistent, logically coherent annotations across multiple species, enabling functional ortholog identification even across large evolutionary distances.
DeepGO-SE, published in Nature Machine Intelligence in 2024, advances the field by demonstrating that neuro-symbolic approaches can outperform purely data-driven classifiers for structured prediction tasks where domain knowledge is available in formal ontological form. The Gene Ontology community has historically relied on homology-based propagation for unannotated proteins; DeepGO provides a principled alternative for the rapidly growing fraction of sequences without useful homologs. The model's public availability on GitHub and the existence of a companion web server lower the barrier for biologists who lack machine learning infrastructure. A key limitation is that DeepGO-SE inherits the quality and coverage of the GOA training corpus: functions with few experimentally validated examples remain difficult to predict reliably, and the model does not currently address context-dependent function (e.g., tissue-specific or condition-dependent activity).
Kulmanov M, Guzmán-Vega FJ, Duek Roggli P, Lane L, Arold ST, Hoehndorf R. Protein function prediction as approximate semantic entailment. Nat Mach Intell. 2024;6:220-228.
DOI: 10.1038/s42256-024-00795-w