scPROTEIN

Deep graph contrastive learning framework for single-cell proteomics embedding, handling peptide uncertainty, missingness, and batch effects.

Released: January 2024

scPROTEIN (single-cell PROTeomics EmbeddINg) is a deep learning framework developed by researchers at Nankai University and Tencent AI Lab Healthcare to address the distinctive analytical challenges of single-cell proteomics data. Published in Nature Methods in March 2024, the framework tackles four compounding problems that have long limited single-cell protein analysis: uncertainty in peptide quantification, pervasive missing values, high measurement noise, and batch effects arising from different experimental runs or platforms.

The framework operates through a deliberate two-stage design. The first stage uses a multitask heteroscedastic regression model to aggregate peptide-level intensity measurements into protein-level abundance estimates, weighting each peptide contribution according to its estimated measurement uncertainty. The second stage takes those protein abundance matrices and applies graph contrastive learning — constructing cell-cell similarity graphs and learning low-dimensional embeddings that are robust to noise and batch variation. Crucially, both stages are integrated into a unified pipeline rather than requiring separate, manually chained preprocessing tools.

This approach positions scPROTEIN as one of the first frameworks to treat peptide quantification uncertainty as a first-class input to the embedding process, rather than discarding uncertain measurements or imputing missing values with simple heuristics.

Key Features

Peptide uncertainty estimation: A multitask heteroscedastic regression model estimates both aleatoric and epistemic uncertainty for each peptide measurement, using that uncertainty to weight the aggregation of peptide intensities into protein-level abundance matrices.
Graph contrastive learning: Cell-cell similarity graphs are constructed from protein expression profiles and processed through a graph neural network with a contrastive learning objective, producing embeddings that capture complex intercellular relationships.
Unified preprocessing pipeline: Data denoising, missing value handling, and batch effect removal are handled within a single coherent framework, eliminating the need to chain separate preprocessing tools before downstream analysis.
Broad downstream task support: A single embedding space supports cell clustering, cell type annotation, batch correction, clinical stratification, and spatially resolved proteomics analysis without task-specific retraining.
Flexible input formats: The framework accepts both CSV and H5AD input formats, covering common outputs from mass spectrometry-based single-cell proteomics workflows.

Technical Details

scPROTEIN is built around two neural network components applied sequentially. Stage 1 implements a multitask heteroscedastic regression architecture that models the distributional uncertainty of each peptide intensity measurement. Peptide-level data are provided as a matrix with one column per cell and one row per peptide; the model outputs protein-level abundance matrices alongside per-measurement uncertainty scores that guide the aggregation step. Stage 2 constructs a cell graph where nodes are individual cells and edge weights reflect protein expression similarity. A graph neural network with a contrastive learning module then learns embeddings in which biologically similar cells cluster together while technical variation — including batch effects — is suppressed.

The model was evaluated across multiple single-cell proteomics datasets spanning diverse cell types, experimental protocols, and biological conditions, consistently outperforming prior methods on clustering accuracy and batch integration benchmarks. No parameter count is reported in the primary publication, as the model's capacity scales with dataset size and graph structure rather than a fixed architecture.

Applications

scPROTEIN is designed for researchers working with mass spectrometry-based single-cell proteomics platforms such as SCoPE-MS, nanoPOTS, or similar technologies. The framework enables accurate cell type identification and clustering in heterogeneous samples, integration of data across multiple experimental batches, and clinical analysis linking proteomic cell states to patient-level outcomes. The spatial proteomics extension allows researchers to analyze tissue sections by overlaying embedding-derived cell type assignments onto spatial coordinates, revealing tissue architecture and local cell-cell interactions. The uncertainty estimates produced in Stage 1 are particularly valuable in clinical contexts where unreliable predictions carry downstream consequences.

Impact

scPROTEIN represents a methodological advance for a field where robust computational tools have historically lagged behind sequencing-based single-cell methods. Its publication in Nature Methods signals peer recognition of the framework's contribution to the proteomics analysis toolkit. By framing peptide uncertainty as a quantitative signal rather than noise to be discarded, the work opens a direction for future methods that more faithfully propagate measurement confidence through the full analysis pipeline. A practical limitation is that the framework's performance depends on the quality and coverage of the mass spectrometry data; very sparse datasets with extreme dropout may still challenge even uncertainty-aware aggregation. The codebase is publicly available on GitHub, enabling adoption and extension by the community.

Citation

scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding

Li, W., Yang, F., Wang, F. et al. scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding. Nat Methods 21, 623–634 (2024).

DOI: 10.1038/s41592-024-02214-9

Recent citations

Papers that recently cited this model.

Gradient-Guided Graph Contrastive Learning for Mass Spectrometry-Based Proteomics Clustering.
Yan Liu, Tai-Yuan Xia, Guo Wei, et al.
Journal of Chemical Information and Modeling · Jul 2026
0
Exploring drug mechanisms through single-cell proteomics.
Yuzhi Sun, Renjie Liu, Jichong Mu, et al.
British Journal of Pharmacology · May 2026
0
Mass spectrometry-based single-cell proteomics technologies, trends, and biological insights.
Rui Hu, Christian Montes, J. Walley
TIBS -Trends in Biochemical Sciences. Regular ed · May 2026
1

Top citations

The most-cited papers that cite this model.

Foundation models in bioinformatics
Fei Guo, Renchu Guan, Yaohang Li, et al.
National Science Review · Jan 2025
44
From Images to Genes: Radiogenomics Based on Artificial Intelligence to Achieve Non‐Invasive Precision Medicine in Cancer Patients
Yusheng Guo, Tianxiang Li, Bingxin Gong, et al.
Advancement of science · Nov 2024
42
Protein Large Language Models: A Comprehensive Survey
Yijia Xiao, Wanjia Zhao, Junkai Zhang, et al.
Conference on Empirical Methods in Natural Language Processing · Feb 2025
38
Graph neural networks for single-cell omics data: a review of approaches and applications
Sijie Li, Heyang Hua, Shengquan Chen
Briefings Bioinform. · Mar 2025
36
Unbiasedly decoding the tumor microenvironment with single-cell multiomics analysis in pancreatic cancer
Yifan Fu, Jinxin Tao, Tao Liu, et al.
Molecular Cancer · Jul 2024
32

Citations

Total Citations33

Influential4

References68

GitHub

Stars55

Forks13

Open Issues1

Contributors1

Last Push1y ago

LanguageJupyter Notebook

LicenseApache-2.0

Fields of citing research

Biology81%
Computer Science81%
Medicine78%
Chemistry16%
Engineering3%
Mathematics3%
Environmental Science3%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

86Open

Usability — can I run it?100

Reproducibility — can I retrain it?87

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper Documentation

Key Features

Peptide uncertainty estimation: A multitask heteroscedastic regression model estimates both aleatoric and epistemic uncertainty for each peptide measurement, using that uncertainty to weight the aggregation of peptide intensities into protein-level abundance matrices.

Graph contrastive learning: Cell-cell similarity graphs are constructed from protein expression profiles and processed through a graph neural network with a contrastive learning objective, producing embeddings that capture complex intercellular relationships.

Unified preprocessing pipeline: Data denoising, missing value handling, and batch effect removal are handled within a single coherent framework, eliminating the need to chain separate preprocessing tools before downstream analysis.

Broad downstream task support: A single embedding space supports cell clustering, cell type annotation, batch correction, clinical stratification, and spatially resolved proteomics analysis without task-specific retraining.

Flexible input formats: The framework accepts both CSV and H5AD input formats, covering common outputs from mass spectrometry-based single-cell proteomics workflows.

Technical Details

Applications

Impact

Citation

scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding

Li, W., Yang, F., Wang, F. et al. scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding. Nat Methods 21, 623–634 (2024).

DOI: 10.1038/s41592-024-02214-9

scPROTEIN

#Key Features

#Technical Details

#Applications

#Impact

Citation

scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

scPROTEIN

#Key Features

#Technical Details

#Applications

#Impact

Citation

scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact