Overview

scMulan is a multitask generative pre-trained language model developed at Tsinghua University for comprehensive single-cell transcriptomic analysis. Released in early 2024, the model addresses a fundamental challenge in the field: most existing single-cell tools are designed for individual tasks, requiring researchers to assemble and coordinate multiple specialized models to complete a typical analysis workflow. scMulan replaces this fragmented approach with a single, unified framework capable of performing cell type annotation, batch integration, and conditional cell generation within the same model.

The core innovation is a structured cell representation scheme called "cell sentences" (c-sentences). Rather than treating a cell's transcriptome as a flat vector of gene expression values, scMulan encodes each cell as a sequence of tuples, where each element pairs an entity (a gene, a metadata term, or a task specification) with its corresponding value. This representation allows the model to incorporate biological metadata — such as tissue of origin, experimental condition, and assay type — directly into the input, giving it the contextual grounding needed to generalize across diverse datasets and experimental settings.

Trained on 10 million single-cell transcriptomic profiles with associated metadata, scMulan's 368 million parameters capture both fine-grained gene regulatory relationships and higher-level tissue-level patterns. Task-specific behavior is controlled through natural-language-style task prompts, enabling zero-shot inference without any additional fine-tuning.

Key Features

Unified Multitask Framework: A single model handles cell type annotation, batch integration, and conditional cell generation, eliminating the need to coordinate separate specialized tools.
Zero-Shot Task Execution: Tasks are directed via structured prompts at inference time, so the model generalizes to new datasets and conditions without retraining or fine-tuning.
C-Sentence Cell Representation: Cells are encoded as ordered sequences of entity-value tuples covering gene expression, metadata terms, and task descriptors, enabling rich contextual learning.
Multi-Organ Annotation Coverage: Pre-configured for zero-shot cell type annotation across seven major human organs: heart, lung, liver, bone marrow, blood, brain, and thymus.
Metadata-Aware Pre-Training: Biological metadata is integrated during pre-training rather than treated as auxiliary information, making the model robust to batch effects and study-level variation.

Technical Details

scMulan is a generative transformer language model with 368 million parameters, pre-trained on 10 million human single-cell RNA-seq profiles sourced from publicly available atlases spanning multiple tissues and experimental protocols. Input cells are formatted as c-sentences: structured sequences of (entity, value) tuples encoding gene expression levels, associated metadata fields (tissue, condition, donor), and the desired downstream task. This unified tokenization scheme allows the same forward pass to serve annotation, generation, and integration objectives, with task selection controlled by a prompt prepended to each c-sentence.

Pre-training employed a multi-task objective designed to simultaneously learn gene expression patterns, metadata associations, and generative dynamics. The result is a model that can produce coherent cell-state representations in a shared embedding space suitable for downstream tasks including UMAP visualization, nearest-neighbor classification, and batch-corrected integration. For zero-shot annotation across the seven supported organs, the model leverages task prompts that specify the target tissue and annotation granularity, then generates cell type labels autoregressively from the learned distribution.

Applications

scMulan is intended for researchers working with human single-cell RNA-seq data who need to annotate cell types, integrate datasets from different experimental batches, or generate synthetic transcriptomic profiles for data augmentation or in silico perturbation studies. Its zero-shot annotation capability is particularly useful for atlas-scale projects where manually curated reference labels are unavailable or incomplete. The batch integration function allows harmonization of data across studies with differing library preparation protocols, sequencing platforms, or donor backgrounds. Conditional cell generation opens avenues for simulating cellular responses to perturbations or for balancing underrepresented cell populations in training datasets for downstream classifiers.

Impact

scMulan represents an important step toward truly general-purpose foundation models for single-cell biology, demonstrating that a single generative model can unify tasks that have historically required separate tools. By framing single-cell analysis as a language modeling problem over structured cell sentences, the work establishes a principled architecture for incorporating heterogeneous metadata into transcriptomic models. At the time of release, the approach was novel in its simultaneous support for discriminative (annotation), integrative (batch correction), and generative tasks within one pre-trained checkpoint. The model's current scope is limited to seven human organs and does not yet cover non-human species or non-transcriptomic modalities such as chromatin accessibility or protein expression, areas that represent natural extensions for future work.

Citation

scMulan: a multitask generative pre-trained language model for single-cell analysis

Preprint

Bian, H., Chen, Y., Dong, X., Li, C., Hao, M., Chen, S., Hu, J., Sun, M., Wei, L., & Zhang, X. (2024). scMulan: a multitask generative pre-trained language model for single-cell analysis. bioRxiv, 2024.01.25.577152.

DOI: 10.1101/2024.01.25.577152

Overview

Key Features

Unified Multitask Framework: A single model handles cell type annotation, batch integration, and conditional cell generation, eliminating the need to coordinate separate specialized tools.

Zero-Shot Task Execution: Tasks are directed via structured prompts at inference time, so the model generalizes to new datasets and conditions without retraining or fine-tuning.

C-Sentence Cell Representation: Cells are encoded as ordered sequences of entity-value tuples covering gene expression, metadata terms, and task descriptors, enabling rich contextual learning.

Multi-Organ Annotation Coverage: Pre-configured for zero-shot cell type annotation across seven major human organs: heart, lung, liver, bone marrow, blood, brain, and thymus.

Metadata-Aware Pre-Training: Biological metadata is integrated during pre-training rather than treated as auxiliary information, making the model robust to batch effects and study-level variation.

Technical Details

Applications

Impact

Citation

scMulan: a multitask generative pre-trained language model for single-cell analysis

Preprint

DOI: 10.1101/2024.01.25.577152

scMulan

Overview

Key Features

Technical Details

Applications

Impact

Citation

scMulan: a multitask generative pre-trained language model for single-cell analysis

Metrics

GitHub

Citations

Tags

Resources

scMulan

Overview

Key Features

Technical Details

Applications

Impact

Citation

scMulan: a multitask generative pre-trained language model for single-cell analysis

Metrics

GitHub

Citations

Tags

Resources