Overview

CellPLM is a single-cell foundation model that inverts the design paradigm of prior transformer-based approaches. Earlier models such as scGPT and Geneformer treat individual genes as tokens and single cells as sentences, following the natural language convention directly. CellPLM instead treats cells as tokens and tissues as sentences, allowing the model to explicitly learn relationships between cells rather than only within them. The work was published as a conference paper at ICLR 2024 and developed at Michigan State University by Hongzhi Wen and colleagues in the OmicsML group.

The core motivation stems from three structural differences between single-cell RNA sequencing (scRNA-seq) data and natural language. First, gene expression profiles are unordered bags of measurements, not sequences — violating a key assumption of standard language modeling. Second, relationships between neighboring cells in a tissue are biologically meaningful in ways that inter-sentence relationships rarely are in text. Third, single-cell data is far scarcer and noisier than the text corpora used to train large language models. CellPLM addresses all three challenges through its architecture and training strategy.

To handle variable-length inputs and reduce quadratic attention complexity, CellPLM replaces the standard transformer encoder with Flowformer, an attention variant that avoids full pairwise attention across cells. The model is pre-trained using spatially-resolved transcriptomic data, which provides ground-truth co-localization of cells within tissues and enables the model to learn biologically grounded cell-cell relationships.

Key Features

Cell-as-token paradigm: Rather than treating genes as tokens, CellPLM encodes each cell as a single token and each tissue sample as a sequence, making cell-cell relationships first-class citizens of the representation.
Spatially-aware pre-training: Training data includes spatially-resolved transcriptomics, giving the model ground-truth spatial co-occurrence information about which cells neighbor one another in tissue.
Gaussian mixture prior: A Gaussian mixture prior distribution serves as an inductive bias to regularize the learned latent space and compensate for the small scale and high noise of single-cell datasets.
100x inference speedup: Cell embedding generation is reported to be approximately 100 times faster than prior pre-trained single-cell models, a practical advantage for large-scale atlas analysis.
Flowformer attention: Replaces standard scaled dot-product attention to resolve input-length constraints and reduce computational complexity when processing variable numbers of cells per tissue sample.

Technical Details

CellPLM is an 85-million parameter encoder-decoder model. The encoder uses Flowformer, a linear-complexity attention mechanism, to process sequences of cell tokens. The decoder reconstructs masked gene expression profiles conditioned on the representations of neighboring cells — a masked modeling objective analogous to masked language modeling but operating at the cell level rather than the gene level.

Pre-training uses spatially-resolved transcriptomic datasets in which the spatial positions of cells are known, allowing the model to group cells by tissue origin and learn from their co-occurrence context. The model checkpoint released publicly is labeled 20230926_85M, compatible with Python 3.9 and PyTorch >= 1.13 with CUDA >= 11.7.

On cell-type annotation benchmarks, CellPLM achieves accuracy scores ranging from 0.902 to 0.983 across six standard datasets — PBMC12K (0.975), Pancreas (0.983), HLCA (0.929), Immune (0.902), Brain (0.967), and Liver (0.913) — consistently matching or exceeding scGPT, Geneformer, scDiff, scANVI, and CellTypist. Inference speed for generating cell embeddings is reported as approximately 100 times higher than that of prior pre-trained baselines.

Applications

CellPLM is well-suited for researchers working on large-scale single-cell atlas analysis where inference cost is a constraint. Its primary validated application is cell-type annotation across diverse tissue types, with demonstrated accuracy on blood, pancreatic, pulmonary, immune, neural, and hepatic cell populations. The cell-as-token design also makes it a natural fit for studying cell-cell interactions and tissue microenvironments — use cases where the spatial or organizational context of a cell matters as much as its individual gene expression profile. The model is installable via pip install cellplm and is positioned for downstream fine-tuning on tasks such as perturbation response prediction and disease-state classification.

Impact

CellPLM was accepted to ICLR 2024 and represents a conceptually distinct branch in the single-cell foundation model landscape, challenging the assumption that the gene-as-token paradigm is the most natural mapping from language modeling to transcriptomics. The 100x inference speedup over existing pre-trained models is a meaningful practical contribution for users working with millions of cells. The primary limitation is that spatial transcriptomics data remains less abundant than standard scRNA-seq data, which constrains the scale of pre-training relative to models like scGPT or GeneFormer that can draw on larger unpaired datasets. Whether the cell-as-token paradigm generalizes robustly to tasks beyond annotation — such as gene regulatory inference or perturbation prediction — remains an open question that the field continues to investigate.

Overview

Key Features

Cell-as-token paradigm: Rather than treating genes as tokens, CellPLM encodes each cell as a single token and each tissue sample as a sequence, making cell-cell relationships first-class citizens of the representation.

Spatially-aware pre-training: Training data includes spatially-resolved transcriptomics, giving the model ground-truth spatial co-occurrence information about which cells neighbor one another in tissue.

Gaussian mixture prior: A Gaussian mixture prior distribution serves as an inductive bias to regularize the learned latent space and compensate for the small scale and high noise of single-cell datasets.

100x inference speedup: Cell embedding generation is reported to be approximately 100 times faster than prior pre-trained single-cell models, a practical advantage for large-scale atlas analysis.

Flowformer attention: Replaces standard scaled dot-product attention to resolve input-length constraints and reduce computational complexity when processing variable numbers of cells per tissue sample.

Technical Details

Applications

Impact

CellPLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

CellPLM: Pre-training of Cell Language Model Beyond Single Cells

Metrics

GitHub

Citations

Tags

Resources

CellPLM

Overview

Key Features

Technical Details

Applications

Impact

Citation

CellPLM: Pre-training of Cell Language Model Beyond Single Cells

Metrics

GitHub

Citations

Tags

Resources