Pretrained transformer for cell type annotation of scRNA-seq data. Trained on 1.1M cells; outperforms supervised methods on cross-dataset transfer.
scBERT (single-cell Bidirectional Encoder Representations from Transformers) is a large-scale pretrained language model developed by Tencent AI Lab for automated cell type annotation of single-cell RNA sequencing (scRNA-seq) data. The model applies the pretraining-then-fine-tuning paradigm, pioneered by BERT in natural language processing, to the challenge of identifying cell types from transcriptomic profiles — a task that conventionally requires labor-intensive manual curation by domain experts. Published in Nature Machine Intelligence in September 2022, scBERT demonstrated that transferring representations learned from millions of unlabelled cells to annotated downstream datasets yields consistent performance gains over task-specific supervised methods.
The central challenge scBERT addresses is the combinatorial complexity of gene expression space. Human cells express roughly 20,000 genes, and raw scRNA-seq data is high-dimensional, sparse, and affected by technical noise and batch effects from different experiments. Classical annotation pipelines reduce this complexity through dimensionality reduction and highly variable gene (HVG) selection, which discards potentially informative signal. scBERT sidesteps these preprocessing steps by modeling gene expression across the full transcriptome using a Performer-based encoder capable of handling inputs with up to 20,000 gene tokens, preserving gene-level interpretability throughout.
Rather than treating each gene's expression value as an isolated scalar, scBERT decomposes the input into two complementary representations: a gene identity embedding derived from gene2vec — which encodes co-expression patterns across the genome — and a discretized expression embedding that bins continuous expression levels into categorical bins. This dual representation captures both what a gene is and how actively it is expressed, allowing the model to learn context-aware representations of cellular transcriptional states during self-supervised pretraining.
scBERT's encoder is built on the Performer architecture, which replaces the quadratic-complexity softmax attention of standard transformers with a linear-complexity kernel approximation (FAVOR+). This is critical for scRNA-seq inputs, where sequence lengths correspond to the number of genes (up to 20,000), making standard attention computationally prohibitive. The model stacks 6 Performer encoder layers with 10 attention heads per layer, operating on 200-dimensional embeddings. Expression values are discretized into 7 bins prior to embedding, converting a continuous sparse matrix into a sequence of categorical tokens that the model processes analogously to word tokens in NLP.
For pretraining, the model was trained on 1,126,580 cells from PanglaoDB using a masked expression modeling objective: a fraction of gene expression values are masked and the model learns to reconstruct them from context. For fine-tuning on cell type annotation, a classification head is appended and trained on labelled datasets. On the Zheng68K PBMC benchmark, scBERT achieved an accuracy of 0.759 and macro F1 score of 0.691, compared to 0.704 and 0.659 for the best competing method. On cross-cohort pancreas annotation, accuracy reached 0.992, substantially exceeding scNym (0.904) and Seurat (0.984). For novel cell type detection, scBERT achieved an accuracy of 0.329 versus 0.174 for competing methods such as SciBet and scmap — nearly doubling performance on this challenging task.
scBERT is designed for researchers working with scRNA-seq data who need accurate, scalable cell type annotation across diverse tissues and experimental contexts. It is particularly well-suited for scenarios involving limited labelled training data, cross-dataset transfer where batch effects complicate direct supervised learning, and atlas-scale studies where manual annotation is impractical. The model can also serve as a foundation for identifying previously uncharacterized cell populations in disease tissue or developmental time courses, where novel cell types are expected but not catalogued in reference atlases. Its attention mechanism makes it a useful tool for hypothesis generation about gene regulatory signatures that define specific cell states.
scBERT was among the first models to apply large-scale self-supervised pretraining to single-cell transcriptomics, establishing a conceptual template that subsequent single-cell foundation models — including scGPT, Geneformer, and Universal Cell Embeddings — have built upon. Its publication in Nature Machine Intelligence in 2022 contributed to a wave of interest in treating the transcriptome as a "language" amenable to NLP-inspired modeling. A 2023 reusability report in Nature Machine Intelligence independently reproduced key benchmark results on the Zheng68K and MacParland liver datasets, supporting the robustness of the original claims. Limitations include the model's focus on human data from PanglaoDB, which may limit generalization to non-human species or highly specialized tissue types underrepresented in the pretraining corpus. Like other single-cell models, scBERT does not integrate multi-omic modalities (e.g., chromatin accessibility or protein abundance), and annotation quality depends on the completeness of the reference cell type ontology used during fine-tuning.
Wang, W., et al. (2022) scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nature Machine Intelligence.
DOI: 10.1038/s42256-022-00534-z