bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

scPROTEIN

TencentAILabHealthcare

Deep graph contrastive learning framework for single-cell proteomics embedding, handling peptide uncertainty, missingness, and batch effects.

Released: 2024

Overview

scPROTEIN (single-cell PROTeomics EmbeddINg) is a deep learning framework developed by researchers at Nankai University and Tencent AI Lab Healthcare to address the distinctive analytical challenges of single-cell proteomics data. Published in Nature Methods in March 2024, the framework tackles four compounding problems that have long limited single-cell protein analysis: uncertainty in peptide quantification, pervasive missing values, high measurement noise, and batch effects arising from different experimental runs or platforms.

The framework operates through a deliberate two-stage design. The first stage uses a multitask heteroscedastic regression model to aggregate peptide-level intensity measurements into protein-level abundance estimates, weighting each peptide contribution according to its estimated measurement uncertainty. The second stage takes those protein abundance matrices and applies graph contrastive learning — constructing cell-cell similarity graphs and learning low-dimensional embeddings that are robust to noise and batch variation. Crucially, both stages are integrated into a unified pipeline rather than requiring separate, manually chained preprocessing tools.

This approach positions scPROTEIN as one of the first frameworks to treat peptide quantification uncertainty as a first-class input to the embedding process, rather than discarding uncertain measurements or imputing missing values with simple heuristics.

Key Features

  • Peptide uncertainty estimation: A multitask heteroscedastic regression model estimates both aleatoric and epistemic uncertainty for each peptide measurement, using that uncertainty to weight the aggregation of peptide intensities into protein-level abundance matrices.
  • Graph contrastive learning: Cell-cell similarity graphs are constructed from protein expression profiles and processed through a graph neural network with a contrastive learning objective, producing embeddings that capture complex intercellular relationships.
  • Unified preprocessing pipeline: Data denoising, missing value handling, and batch effect removal are handled within a single coherent framework, eliminating the need to chain separate preprocessing tools before downstream analysis.
  • Broad downstream task support: A single embedding space supports cell clustering, cell type annotation, batch correction, clinical stratification, and spatially resolved proteomics analysis without task-specific retraining.
  • Flexible input formats: The framework accepts both CSV and H5AD input formats, covering common outputs from mass spectrometry-based single-cell proteomics workflows.

Technical Details

scPROTEIN is built around two neural network components applied sequentially. Stage 1 implements a multitask heteroscedastic regression architecture that models the distributional uncertainty of each peptide intensity measurement. Peptide-level data are provided as a matrix with one column per cell and one row per peptide; the model outputs protein-level abundance matrices alongside per-measurement uncertainty scores that guide the aggregation step. Stage 2 constructs a cell graph where nodes are individual cells and edge weights reflect protein expression similarity. A graph neural network with a contrastive learning module then learns embeddings in which biologically similar cells cluster together while technical variation — including batch effects — is suppressed.

The model was evaluated across multiple single-cell proteomics datasets spanning diverse cell types, experimental protocols, and biological conditions, consistently outperforming prior methods on clustering accuracy and batch integration benchmarks. No parameter count is reported in the primary publication, as the model's capacity scales with dataset size and graph structure rather than a fixed architecture.

Applications

scPROTEIN is designed for researchers working with mass spectrometry-based single-cell proteomics platforms such as SCoPE-MS, nanoPOTS, or similar technologies. The framework enables accurate cell type identification and clustering in heterogeneous samples, integration of data across multiple experimental batches, and clinical analysis linking proteomic cell states to patient-level outcomes. The spatial proteomics extension allows researchers to analyze tissue sections by overlaying embedding-derived cell type assignments onto spatial coordinates, revealing tissue architecture and local cell-cell interactions. The uncertainty estimates produced in Stage 1 are particularly valuable in clinical contexts where unreliable predictions carry downstream consequences.

Impact

scPROTEIN represents a methodological advance for a field where robust computational tools have historically lagged behind sequencing-based single-cell methods. Its publication in Nature Methods signals peer recognition of the framework's contribution to the proteomics analysis toolkit. By framing peptide uncertainty as a quantitative signal rather than noise to be discarded, the work opens a direction for future methods that more faithfully propagate measurement confidence through the full analysis pipeline. A practical limitation is that the framework's performance depends on the quality and coverage of the mass spectrometry data; very sparse datasets with extreme dropout may still challenge even uncertainty-aware aggregation. The codebase is publicly available on GitHub, enabling adoption and extension by the community.

Citation

scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding

Li, W., Yang, F., Wang, F. et al. scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding. Nat Methods 21, 623–634 (2024).

DOI: 10.1038/s41592-024-02214-9

Metrics

GitHub

Stars56
Forks13
Open Issues1
Contributors1
Last Push1y ago
LanguageJupyter Notebook
LicenseApache-2.0

Citations

Total Citations29
Influential3
References68

Tags

foundation model

Resources

GitHub RepositoryResearch Paper