Overview

Cell type annotation is one of the most fundamental steps in single-cell RNA sequencing (scRNA-seq) analysis, yet it remains challenging due to biological noise, batch effects across sequencing platforms, and the presence of cell populations not represented in reference atlases. scPML addresses these challenges by introducing a pathway-based multi-view learning framework that incorporates biological prior knowledge — specifically gene signaling pathways — directly into the annotation process. Rather than treating all genes as an undifferentiated feature space, scPML partitions genes according to established pathway databases, constructing a distinct cell-cell interaction graph for each pathway and learning from these complementary biological perspectives simultaneously.

Published in Communications Biology in December 2023 by researchers at Shenzhen University and Northwestern Polytechnical University, scPML combines self-supervised graph autoencoders with multi-view latent subspace learning and a downstream classification module. This design enables the model to denoise single-cell expression data, capture high-order relationships between cells, and unify information across diverse biological views into a coherent low-dimensional representation. A key practical advantage is the model's ability to detect unknown cell types — cells whose labels are absent from the reference dataset — making it directly applicable to exploratory studies where the full cellular landscape is not yet characterized.

The framework draws on four major pathway databases (KEGG, Reactome, WikiPathways, and Yan), computing per-cell pathway activity scores using AUCell and constructing adjacency matrices via mutual nearest neighbors (MNN). This structured, biology-aware approach to graph construction distinguishes scPML from purely data-driven single-cell classification methods.

Key Features

Pathway-Guided Graph Construction: Gene expression data is partitioned using curated pathway databases (KEGG, Reactome, WikiPathways, Yan), and a separate cell-cell graph is built for each pathway using AUCell scores and MNN-based adjacency matrices, encoding distinct biological processes as independent views.
Multi-View Graph Autoencoder: Each pathway view is processed by a self-supervised graph convolutional network (GCN) autoencoder that learns denoised, low-dimensional cell representations incorporating hierarchical neighborhood information. Non-zero masking during training improves robustness to dropout noise common in scRNA-seq.
Latent Subspace Fusion: Representations from all pathway views are unified through a multi-view latent subspace learning step with combined reconstruction and classification losses, ensuring that the integrated embedding captures complementary information from each biological perspective.
Unknown Cell Type Detection: scPML incorporates a rejection mechanism that identifies cells with low confidence scores as "unknown," enabling robust annotation in datasets containing novel or rare cell populations absent from reference atlases.
Cross-Platform and Cross-Species Generalization: The model is designed for transfer scenarios, maintaining high accuracy when the reference and query datasets come from different sequencing platforms or even different species.

Technical Details

scPML's architecture proceeds through four stages. First, pathway-based cell-cell graphs are constructed for each of the four supported pathway databases. Second, individual GCN autoencoders process each graph, with encoders aggregating local and high-order neighbor information and decoders reconstructing masked non-zero expression values — a self-supervised objective that forces the model to learn meaningful cell representations. Third, multi-view latent subspace learning fuses all per-view embeddings by jointly optimizing reconstruction and classification objectives. Fourth, a two-layer fully connected network with softmax assigns final cell type probabilities.

On cross-platform benchmarks using 11 to 12 PBMC datasets spanning six sequencing technologies, scPML achieved a mean accuracy of 0.87, compared to 0.81 for Seurat and 0.78 for scGCN. In cross-species pancreatic cell annotation (mouse and human), scPML reached an average accuracy of 0.94 versus 0.88 for Seurat and 0.927 for scGCN. The unknown cell type detection benchmark — evaluated on four tumor datasets — produced a macro F1 of 0.807 for scPML, substantially outperforming CHETAH (0.587), scGCN (0.530), and scmap (0.282). Ablation experiments demonstrated that integrating four pathway views improved accuracy from 0.877 (single view) to 0.951 on cross-species tasks, confirming that multi-view fusion provides meaningful complementarity beyond any single pathway database.

Applications

scPML is suited for researchers performing cell type annotation in exploratory single-cell studies, particularly when reference datasets are drawn from different sequencing platforms, tissues, or organisms than the query data. Its unknown cell type detection capability makes it valuable for tumor microenvironment analysis, developmental biology, and any context where novel cell populations may be present. The framework can also serve as a pre-processing step upstream of trajectory inference, differential expression, or gene regulatory network analysis, providing high-confidence cell type labels that improve downstream interpretability.

Impact

scPML demonstrates that incorporating structured biological knowledge — in the form of gene pathway databases — into graph construction substantially improves single-cell classification over purely data-driven approaches. By outperforming established tools such as Seurat and scGCN across cross-platform, cross-species, and unknown cell type benchmarks, it establishes a compelling case for biology-informed multi-view learning in the single-cell field. The model's main acknowledged limitation is interpretability: as a neural network, its internal reasoning is not directly transparent, though the authors recommend downstream differential gene expression and enrichment analysis as practical workarounds. The code and pretrained models are publicly available on GitHub, enabling adoption and extension by the single-cell community.

Overview

Key Features

Pathway-Guided Graph Construction: Gene expression data is partitioned using curated pathway databases (KEGG, Reactome, WikiPathways, Yan), and a separate cell-cell graph is built for each pathway using AUCell scores and MNN-based adjacency matrices, encoding distinct biological processes as independent views.

Multi-View Graph Autoencoder: Each pathway view is processed by a self-supervised graph convolutional network (GCN) autoencoder that learns denoised, low-dimensional cell representations incorporating hierarchical neighborhood information. Non-zero masking during training improves robustness to dropout noise common in scRNA-seq.

Latent Subspace Fusion: Representations from all pathway views are unified through a multi-view latent subspace learning step with combined reconstruction and classification losses, ensuring that the integrated embedding captures complementary information from each biological perspective.

Unknown Cell Type Detection: scPML incorporates a rejection mechanism that identifies cells with low confidence scores as "unknown," enabling robust annotation in datasets containing novel or rare cell populations absent from reference atlases.

Cross-Platform and Cross-Species Generalization: The model is designed for transfer scenarios, maintaining high accuracy when the reference and query datasets come from different sequencing platforms or even different species.

Technical Details

Applications

Impact

scPML

Overview

Key Features

Technical Details

Applications

Impact

Citation

scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data

Metrics

GitHub

Citations

Tags

Resources

scPML

Overview

Key Features

Technical Details

Applications

Impact

Citation

scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data

Metrics

GitHub

Citations

Tags

Resources