Deep graph contrastive learning framework for single-cell proteomics embedding, handling peptide uncertainty, missingness, and batch effects.
scPROTEIN (single-cell PROTeomics EmbeddINg) is a deep learning framework developed by researchers at Nankai University and Tencent AI Lab Healthcare to address the distinctive analytical challenges of single-cell proteomics data. Published in Nature Methods in March 2024, the framework tackles four compounding problems that have long limited single-cell protein analysis: uncertainty in peptide quantification, pervasive missing values, high measurement noise, and batch effects arising from different experimental runs or platforms.
The framework operates through a deliberate two-stage design. The first stage uses a multitask heteroscedastic regression model to aggregate peptide-level intensity measurements into protein-level abundance estimates, weighting each peptide contribution according to its estimated measurement uncertainty. The second stage takes those protein abundance matrices and applies graph contrastive learning — constructing cell-cell similarity graphs and learning low-dimensional embeddings that are robust to noise and batch variation. Crucially, both stages are integrated into a unified pipeline rather than requiring separate, manually chained preprocessing tools.
This approach positions scPROTEIN as one of the first frameworks to treat peptide quantification uncertainty as a first-class input to the embedding process, rather than discarding uncertain measurements or imputing missing values with simple heuristics.
scPROTEIN is built around two neural network components applied sequentially. Stage 1 implements a multitask heteroscedastic regression architecture that models the distributional uncertainty of each peptide intensity measurement. Peptide-level data are provided as a matrix with one column per cell and one row per peptide; the model outputs protein-level abundance matrices alongside per-measurement uncertainty scores that guide the aggregation step. Stage 2 constructs a cell graph where nodes are individual cells and edge weights reflect protein expression similarity. A graph neural network with a contrastive learning module then learns embeddings in which biologically similar cells cluster together while technical variation — including batch effects — is suppressed.
The model was evaluated across multiple single-cell proteomics datasets spanning diverse cell types, experimental protocols, and biological conditions, consistently outperforming prior methods on clustering accuracy and batch integration benchmarks. No parameter count is reported in the primary publication, as the model's capacity scales with dataset size and graph structure rather than a fixed architecture.
scPROTEIN is designed for researchers working with mass spectrometry-based single-cell proteomics platforms such as SCoPE-MS, nanoPOTS, or similar technologies. The framework enables accurate cell type identification and clustering in heterogeneous samples, integration of data across multiple experimental batches, and clinical analysis linking proteomic cell states to patient-level outcomes. The spatial proteomics extension allows researchers to analyze tissue sections by overlaying embedding-derived cell type assignments onto spatial coordinates, revealing tissue architecture and local cell-cell interactions. The uncertainty estimates produced in Stage 1 are particularly valuable in clinical contexts where unreliable predictions carry downstream consequences.
scPROTEIN represents a methodological advance for a field where robust computational tools have historically lagged behind sequencing-based single-cell methods. Its publication in Nature Methods signals peer recognition of the framework's contribution to the proteomics analysis toolkit. By framing peptide uncertainty as a quantitative signal rather than noise to be discarded, the work opens a direction for future methods that more faithfully propagate measurement confidence through the full analysis pipeline. A practical limitation is that the framework's performance depends on the quality and coverage of the mass spectrometry data; very sparse datasets with extreme dropout may still challenge even uncertainty-aware aggregation. The codebase is publicly available on GitHub, enabling adoption and extension by the community.
Li, W., Yang, F., Wang, F. et al. scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding. Nat Methods 21, 623–634 (2024).
DOI: 10.1038/s41592-024-02214-9