Single-cell transformer that treats cells as tokens and tissues as sentences, encoding cell-cell relationships with 100x faster inference than prior pre-trained models.
CellPLM is a single-cell foundation model that inverts the design paradigm of prior transformer-based approaches. Earlier models such as scGPT and Geneformer treat individual genes as tokens and single cells as sentences, following the natural language convention directly. CellPLM instead treats cells as tokens and tissues as sentences, allowing the model to explicitly learn relationships between cells rather than only within them. The work was published as a conference paper at ICLR 2024 and developed at Michigan State University by Hongzhi Wen and colleagues in the OmicsML group.
The core motivation stems from three structural differences between single-cell RNA sequencing (scRNA-seq) data and natural language. First, gene expression profiles are unordered bags of measurements, not sequences — violating a key assumption of standard language modeling. Second, relationships between neighboring cells in a tissue are biologically meaningful in ways that inter-sentence relationships rarely are in text. Third, single-cell data is far scarcer and noisier than the text corpora used to train large language models. CellPLM addresses all three challenges through its architecture and training strategy.
To handle variable-length inputs and reduce quadratic attention complexity, CellPLM replaces the standard transformer encoder with Flowformer, an attention variant that avoids full pairwise attention across cells. The model is pre-trained using spatially-resolved transcriptomic data, which provides ground-truth co-localization of cells within tissues and enables the model to learn biologically grounded cell-cell relationships.
CellPLM is an 85-million parameter encoder-decoder model. The encoder uses Flowformer, a linear-complexity attention mechanism, to process sequences of cell tokens. The decoder reconstructs masked gene expression profiles conditioned on the representations of neighboring cells — a masked modeling objective analogous to masked language modeling but operating at the cell level rather than the gene level.
Pre-training uses spatially-resolved transcriptomic datasets in which the spatial positions of cells are known, allowing the model to group cells by tissue origin and learn from their co-occurrence context. The model checkpoint released publicly is labeled 20230926_85M, compatible with Python 3.9 and PyTorch >= 1.13 with CUDA >= 11.7.
On cell-type annotation benchmarks, CellPLM achieves accuracy scores ranging from 0.902 to 0.983 across six standard datasets — PBMC12K (0.975), Pancreas (0.983), HLCA (0.929), Immune (0.902), Brain (0.967), and Liver (0.913) — consistently matching or exceeding scGPT, Geneformer, scDiff, scANVI, and CellTypist. Inference speed for generating cell embeddings is reported as approximately 100 times higher than that of prior pre-trained baselines.
CellPLM is well-suited for researchers working on large-scale single-cell atlas analysis where inference cost is a constraint. Its primary validated application is cell-type annotation across diverse tissue types, with demonstrated accuracy on blood, pancreatic, pulmonary, immune, neural, and hepatic cell populations. The cell-as-token design also makes it a natural fit for studying cell-cell interactions and tissue microenvironments — use cases where the spatial or organizational context of a cell matters as much as its individual gene expression profile. The model is installable via pip install cellplm and is positioned for downstream fine-tuning on tasks such as perturbation response prediction and disease-state classification.
CellPLM was accepted to ICLR 2024 and represents a conceptually distinct branch in the single-cell foundation model landscape, challenging the assumption that the gene-as-token paradigm is the most natural mapping from language modeling to transcriptomics. The 100x inference speedup over existing pre-trained models is a meaningful practical contribution for users working with millions of cells. The primary limitation is that spatial transcriptomics data remains less abundant than standard scRNA-seq data, which constrains the scale of pre-training relative to models like scGPT or GeneFormer that can draw on larger unpaired datasets. Whether the cell-as-token paradigm generalizes robustly to tasks beyond annotation — such as gene regulatory inference or perturbation prediction — remains an open question that the field continues to investigate.
Wen, H., et al. (2023) CellPLM: Pre-training of Cell Language Model Beyond Single Cells. bioRxiv.
DOI: 10.1101/2023.10.03.560734