A graph-attention model producing context-aware protein embeddings from protein-protein interaction, co-expression, and tissue networks, with biologically motivated data splits.
GATSBI (Graph Attention for Tissue- and Sequence-aware Biological Integration) is a graph-attention framework for learning context-aware protein embeddings from heterogeneous biological networks. Rather than representing a protein solely by its sequence, GATSBI propagates information across a network that integrates physical protein-protein interactions, co-expression relationships, and tissue-specific functional associations, producing embeddings that reflect the cellular and functional context in which a protein operates. It was developed by Gowri Nayar and Russ B. Altman in the Helix Research Lab at Stanford University's Department of Biomedical Data Science and posted to bioRxiv in early 2026.
The central contribution of GATSBI is not only the architecture but an argument about evaluation: the authors show that how a benchmark splits its data fundamentally shapes which embeddings appear to perform well. They introduce two biologically motivated splitting strategies — a transductive edge split and an inductive, sequence-similarity-aware node split — that better reflect realistic prediction scenarios and prevent information leakage between training and test partitions.
Across interaction, function, and functional-set prediction tasks, GATSBI consistently outperforms existing pretrained network embeddings (notably Pinnacle), with the largest gains observed for understudied proteins and under the most demanding inductive evaluation. This positions GATSBI as a resource for transferring knowledge from well-characterized proteins to poorly annotated ones, where sequence-only representations often struggle.
GATSBI is a heterogeneous graph attention network whose nodes are initialized with 1,280-dimensional ESM-2 sequence embeddings and whose edges are annotated by source, association score, and tissue type. The network spans 18,049 protein nodes and 1,575,310 edges drawn from STRING physical interactions (217,092 edges, score ≥0.6), STRING co-expression (151,067 edges, score ≥0.6), and HumanBase tissue-specific associations (1,207,151 edges across 144 tissues and cell types, posterior probability ≥0.6). Attention coefficients are modulated by type-specific transformation matrices, learnable edge-type priors, and tissue-consistency priors, and the model is trained self-supervised via masked link prediction with degree-matched negative sampling (5:1 negative-to-positive ratio). Downstream tasks attach lightweight heads to the frozen embeddings — a shallow network on concatenated pairs for interaction prediction, a 3-layer MLP for enzyme-class (EC) function prediction, and multi-head attention pooling into a 3-layer MLP for Reactome functional-set prediction. On benchmarks, the edge-split model reaches AUROC 0.878 / AUPRC 0.869 for interaction prediction and AUROC 0.804 / AUPRC 0.821 for pathway-set prediction, while the inductive node-split model reaches AUROC 0.746 for interactions and AUROC 0.679 for EC function on entirely unseen proteins.
GATSBI embeddings are intended as drop-in protein representations for downstream prediction tasks, particularly where contextual, network-level information complements sequence. Researchers can download the pretrained embeddings or model weights and apply them to protein-protein interaction prediction, enzyme-function annotation, and pathway/functional-set assignment, or use them as features in their own classifiers. Because the largest performance gains accrue to understudied, low-degree proteins, the model is especially useful for prioritizing experiments and generating hypotheses about poorly characterized proteins — for example in target discovery, functional annotation of orphan proteins, and tissue-aware analyses of protein function.
GATSBI advances the conversation about how protein-network embeddings should be evaluated, demonstrating that transductive and inductive splits can yield reversed performance rankings and that no single benchmark captures all real-world use cases. By coupling this methodological critique with a model that integrates interaction, co-expression, and tissue context — and by releasing embeddings and weights under an open CC-BY 4.0 license — the work provides both a practical resource and a template for more rigorous, leakage-aware evaluation of context-aware protein representations. As a recent preprint from a single lab, its downstream adoption is still emerging, and its embeddings are currently centered on human proteins, which bounds applicability to other organisms.