GATSBI

Graph attention model that learns context-aware protein embeddings from protein-protein interaction, co-expression, and tissue association networks.

Released: April 2026

GATSBI (Graph Attention for Tissue- and Sequence-aware Biological Integration) is a graph-attention framework for learning context-aware protein embeddings from heterogeneous biological networks. Rather than representing a protein solely by its sequence, GATSBI propagates information across a network that integrates physical protein-protein interactions, co-expression relationships, and tissue-specific functional associations, producing embeddings that reflect the cellular and functional context in which a protein operates. It was developed by Gowri Nayar and Russ B. Altman in the Helix Research Lab at Stanford University's Department of Biomedical Data Science and posted to bioRxiv in early 2026.

The central contribution of GATSBI is not only the architecture but an argument about evaluation: the authors show that how a benchmark splits its data fundamentally shapes which embeddings appear to perform well. They introduce two biologically motivated splitting strategies — a transductive edge split and an inductive, sequence-similarity-aware node split — that better reflect realistic prediction scenarios and prevent information leakage between training and test partitions.

Across interaction, function, and functional-set prediction tasks, GATSBI consistently outperforms existing pretrained network embeddings (notably Pinnacle), with the largest gains observed for understudied proteins and under the most demanding inductive evaluation. This positions GATSBI as a resource for transferring knowledge from well-characterized proteins to poorly annotated ones, where sequence-only representations often struggle.

Key Features

Heterogeneous network integration: Combines physical interactions and co-expression (STRING) with tissue-specific associations (HumanBase) across 18,049 proteins and roughly 1.58 million edges, giving low-degree proteins complementary evidence from multiple sources.
Tissue- and edge-type-aware attention: The graph attention mechanism factorizes attention coefficients with learnable edge-type priors and tissue-consistency priors, biasing message passing toward biologically coherent neighbors and contexts.
Biologically motivated data splits: A transductive edge split (with a minimum 10-hop separation between test pairs) and an inductive node split (with a strict <30% sequence-identity threshold via MMseqs2) expose how evaluation design changes apparent performance.
Gains for understudied proteins: Averaged across tasks, GATSBI reaches AUROC 0.781 / AUPRC 0.801 on understudied proteins versus 0.522 / 0.511 for Pinnacle, a substantial improvement for the proteins that most need it.
Downloadable embeddings and weights: Pretrained node- and edge-split models (~102 MB and ~127 MB) and precomputed embeddings are released on Zenodo under CC-BY 4.0 for direct reuse in transfer tasks.

Technical Details

GATSBI is a heterogeneous graph attention network whose nodes are initialized with 1,280-dimensional ESM-2 sequence embeddings and whose edges are annotated by source, association score, and tissue type. The network spans 18,049 protein nodes and 1,575,310 edges drawn from STRING physical interactions (217,092 edges, score ≥0.6), STRING co-expression (151,067 edges, score ≥0.6), and HumanBase tissue-specific associations (1,207,151 edges across 144 tissues and cell types, posterior probability ≥0.6). Attention coefficients are modulated by type-specific transformation matrices, learnable edge-type priors, and tissue-consistency priors, and the model is trained self-supervised via masked link prediction with degree-matched negative sampling (5:1 negative-to-positive ratio). Downstream tasks attach lightweight heads to the frozen embeddings — a shallow network on concatenated pairs for interaction prediction, a 3-layer MLP for enzyme-class (EC) function prediction, and multi-head attention pooling into a 3-layer MLP for Reactome functional-set prediction. On benchmarks, the edge-split model reaches AUROC 0.878 / AUPRC 0.869 for interaction prediction and AUROC 0.804 / AUPRC 0.821 for pathway-set prediction, while the inductive node-split model reaches AUROC 0.746 for interactions and AUROC 0.679 for EC function on entirely unseen proteins.

Applications

GATSBI embeddings are intended as drop-in protein representations for downstream prediction tasks, particularly where contextual, network-level information complements sequence. Researchers can download the pretrained embeddings or model weights and apply them to protein-protein interaction prediction, enzyme-function annotation, and pathway/functional-set assignment, or use them as features in their own classifiers. Because the largest performance gains accrue to understudied, low-degree proteins, the model is especially useful for prioritizing experiments and generating hypotheses about poorly characterized proteins — for example in target discovery, functional annotation of orphan proteins, and tissue-aware analyses of protein function.

Impact

GATSBI advances the conversation about how protein-network embeddings should be evaluated, demonstrating that transductive and inductive splits can yield reversed performance rankings and that no single benchmark captures all real-world use cases. By coupling this methodological critique with a model that integrates interaction, co-expression, and tissue context — and by releasing embeddings and weights under an open CC-BY 4.0 license — the work provides both a practical resource and a template for more rigorous, leakage-aware evaluation of context-aware protein representations. As a recent preprint from a single lab, its downstream adoption is still emerging, and its embeddings are currently centered on human proteins, which bounds applicability to other organisms.

Citation

GATSBI: Improving context-aware protein embeddings through biologically motivated data splits

Nayar, G. & Altman, R. (2026) GATSBI: Improving context-aware protein embeddings through biologically motivated data splits. bioRxiv.

DOI: 10.64898/2026.02.13.705830

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References25

GitHub

Stars13

Forks0

Open Issues0

Contributors1

Last Push3mo ago

LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

94Open

Usability — can I run it?100

Reproducibility — can I retrain it?87

Model Openness Framework

Class II

Open Tooling

Resources

GitHub Repository Research Paper Dataset

Key Features

Heterogeneous network integration: Combines physical interactions and co-expression (STRING) with tissue-specific associations (HumanBase) across 18,049 proteins and roughly 1.58 million edges, giving low-degree proteins complementary evidence from multiple sources.

Tissue- and edge-type-aware attention: The graph attention mechanism factorizes attention coefficients with learnable edge-type priors and tissue-consistency priors, biasing message passing toward biologically coherent neighbors and contexts.

Biologically motivated data splits: A transductive edge split (with a minimum 10-hop separation between test pairs) and an inductive node split (with a strict <30% sequence-identity threshold via MMseqs2) expose how evaluation design changes apparent performance.

Gains for understudied proteins: Averaged across tasks, GATSBI reaches AUROC 0.781 / AUPRC 0.801 on understudied proteins versus 0.522 / 0.511 for Pinnacle, a substantial improvement for the proteins that most need it.

Downloadable embeddings and weights: Pretrained node- and edge-split models (~102 MB and ~127 MB) and precomputed embeddings are released on Zenodo under CC-BY 4.0 for direct reuse in transfer tasks.

Technical Details

Applications

Impact

GATSBI

Key Features

Technical Details

Applications

Impact

Citation

GATSBI: Improving context-aware protein embeddings through biologically motivated data splits

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

GATSBI

Key Features

Technical Details

Applications

Impact

Citation

GATSBI: Improving context-aware protein embeddings through biologically motivated data splits

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

GATSBI

#Key Features

#Technical Details

#Applications

#Impact

Citation

GATSBI: Improving context-aware protein embeddings through biologically motivated data splits

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

GATSBI

#Key Features

#Technical Details

#Applications

#Impact

Citation

GATSBI: Improving context-aware protein embeddings through biologically motivated data splits

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact