A 368M-parameter generative language model for single-cell transcriptomics, enabling zero-shot cell type annotation, batch integration, and conditional cell generation.
scMulan is a multitask generative pre-trained language model developed at Tsinghua University for comprehensive single-cell transcriptomic analysis. Released in early 2024, the model addresses a fundamental challenge in the field: most existing single-cell tools are designed for individual tasks, requiring researchers to assemble and coordinate multiple specialized models to complete a typical analysis workflow. scMulan replaces this fragmented approach with a single, unified framework capable of performing cell type annotation, batch integration, and conditional cell generation within the same model.
The core innovation is a structured cell representation scheme called "cell sentences" (c-sentences). Rather than treating a cell's transcriptome as a flat vector of gene expression values, scMulan encodes each cell as a sequence of tuples, where each element pairs an entity (a gene, a metadata term, or a task specification) with its corresponding value. This representation allows the model to incorporate biological metadata — such as tissue of origin, experimental condition, and assay type — directly into the input, giving it the contextual grounding needed to generalize across diverse datasets and experimental settings.
Trained on 10 million single-cell transcriptomic profiles with associated metadata, scMulan's 368 million parameters capture both fine-grained gene regulatory relationships and higher-level tissue-level patterns. Task-specific behavior is controlled through natural-language-style task prompts, enabling zero-shot inference without any additional fine-tuning.
scMulan is a generative transformer language model with 368 million parameters, pre-trained on 10 million human single-cell RNA-seq profiles sourced from publicly available atlases spanning multiple tissues and experimental protocols. Input cells are formatted as c-sentences: structured sequences of (entity, value) tuples encoding gene expression levels, associated metadata fields (tissue, condition, donor), and the desired downstream task. This unified tokenization scheme allows the same forward pass to serve annotation, generation, and integration objectives, with task selection controlled by a prompt prepended to each c-sentence.
Pre-training employed a multi-task objective designed to simultaneously learn gene expression patterns, metadata associations, and generative dynamics. The result is a model that can produce coherent cell-state representations in a shared embedding space suitable for downstream tasks including UMAP visualization, nearest-neighbor classification, and batch-corrected integration. For zero-shot annotation across the seven supported organs, the model leverages task prompts that specify the target tissue and annotation granularity, then generates cell type labels autoregressively from the learned distribution.
scMulan is intended for researchers working with human single-cell RNA-seq data who need to annotate cell types, integrate datasets from different experimental batches, or generate synthetic transcriptomic profiles for data augmentation or in silico perturbation studies. Its zero-shot annotation capability is particularly useful for atlas-scale projects where manually curated reference labels are unavailable or incomplete. The batch integration function allows harmonization of data across studies with differing library preparation protocols, sequencing platforms, or donor backgrounds. Conditional cell generation opens avenues for simulating cellular responses to perturbations or for balancing underrepresented cell populations in training datasets for downstream classifiers.
scMulan represents an important step toward truly general-purpose foundation models for single-cell biology, demonstrating that a single generative model can unify tasks that have historically required separate tools. By framing single-cell analysis as a language modeling problem over structured cell sentences, the work establishes a principled architecture for incorporating heterogeneous metadata into transcriptomic models. At the time of release, the approach was novel in its simultaneous support for discriminative (annotation), integrative (batch correction), and generative tasks within one pre-trained checkpoint. The model's current scope is limited to seven human organs and does not yet cover non-human species or non-transcriptomic modalities such as chromatin accessibility or protein expression, areas that represent natural extensions for future work.
Bian, H., Chen, Y., Dong, X., Li, C., Hao, M., Chen, S., Hu, J., Sun, M., Wei, L., & Zhang, X. (2024). scMulan: a multitask generative pre-trained language model for single-cell analysis. bioRxiv, 2024.01.25.577152.
DOI: 10.1101/2024.01.25.577152