Geometric deep learning model generating context-aware protein representations across 156 cell-type contexts from a multi-organ single-cell atlas.
PINNACLE (Protein Interaction Networks with Context-Aware Learning Embeddings) is a geometric deep learning framework developed at the Zitnik Lab, Harvard Medical School, that generates context-aware protein representations at single-cell resolution. Published in Nature Methods in August 2024, it addresses a fundamental limitation of conventional protein representation models: they assign a single embedding to each protein regardless of the cellular context in which that protein is expressed. Because the same protein can participate in very different interaction networks and regulatory programs depending on the cell type and tissue, context-free representations conflate biologically distinct states.
PINNACLE resolves this by integrating three complementary data sources — a global protein interaction network, cell-type-specific protein interaction networks derived from single-cell transcriptomics, and a tissue hierarchy metagraph — to produce a unique embedding for each protein in each cell type where it is active. Trained on a multi-organ single-cell atlas spanning 24 human tissues and organs, PINNACLE generates 394,760 protein representations distributed across 156 cell-type contexts, constituting the largest contextualized protein embedding space of its kind at the time of publication.
The work was led by Michelle M. Li, Yepeng Huang, Marissa Sumathipala, and colleagues including Marinka Zitnik, Alberto Valdeolivas, and collaborators with expertise in rheumatology and gastroenterology at Brigham and Women's Hospital. The model is freely available through GitHub and HuggingFace, and pretrained checkpoints can be fine-tuned for downstream tasks.
PINNACLE is built on geometric deep learning, a family of methods that generalize neural networks to graph-structured data. The model operates on a hierarchical graph construction: cell-type-specific protein interaction networks (derived by weighting a global reference interactome with single-cell gene expression data) are connected via a metagraph encoding cell-type-to-cell-type and cell-type-to-tissue relationships. This nested, multi-resolution graph is the input over which PINNACLE learns.
Training is self-supervised and employs protein-, cell-type-, and tissue-level objective functions to simultaneously encode local protein neighborhood structure and global cellular organization. The multi-scale attention architecture allows information to propagate across levels of biological organization — from individual protein interactions up through cell-type identity and tissue-level programs — in a unified embedding space. Single-cell transcriptomic data from a comprehensive multi-organ human atlas (covering 24 tissues and 156 cell types) provides the expression-based context that shapes cell-type-specific networks. The pretrained model and embeddings are hosted on HuggingFace under the Therapeutics Data Commons (TDC) organization, making them straightforward to load for fine-tuning.
PINNACLE is designed for researchers working at the intersection of computational protein biology, systems pharmacology, and single-cell genomics. Its primary demonstrated application is therapeutic target identification: by providing cell-type-resolved protein representations, PINNACLE helps prioritize not just which proteins to target but in which cell types intervention is most likely to be effective — a critical consideration for immune-mediated inflammatory diseases such as rheumatoid arthritis and inflammatory bowel disease. Beyond target nomination, the framework supports studies of drug mechanism of action by modeling how a drug's transcriptomic perturbation propagates differently across cell types. Researchers can also use PINNACLE's embeddings to augment structure-based methods, such as docking or protein interaction prediction, with cell-type context that sequence- or structure-only models cannot provide.
PINNACLE represents a conceptual advance in how the field approaches protein representation learning by demonstrating that cellular context is a first-class feature, not an afterthought. The paper's benchmarks show consistent gains over context-free models on disease-relevant tasks, providing a concrete argument for incorporating single-cell data into protein AI pipelines. The model's release through HuggingFace and its compatibility with the Therapeutics Data Commons ecosystem lower adoption barriers for groups without specialized infrastructure. A key limitation is that PINNACLE's representations are currently anchored to the specific cell types and tissues present in its training atlas; extending coverage to rarer cell types, diseased tissue states, or non-human organisms will require retraining or fine-tuning on new atlases. The framework nonetheless establishes an important template for context-sensitive biological foundation models that subsequent work in this area is likely to build upon.
Li, M. M., et al. (2023) Contextual AI models for single-cell protein biology. bioRxiv.
DOI: 10.1038/s41592-024-02341-3