Overview

scGPT is a generative pre-trained transformer designed specifically for single-cell biology, developed by Bo Wang's group at the University of Toronto and published in Nature Methods in February 2024. It addresses a core challenge in single-cell genomics: despite the accumulation of massive datasets spanning tens of millions of cells across tissues, diseases, and perturbation conditions, no unified model had previously been able to leverage this data to learn generalizable cellular representations. scGPT fills this gap by adapting the transformer architecture to the non-sequential nature of gene expression data, enabling joint learning of both cell and gene representations from large-scale single-cell RNA sequencing corpora.

The key conceptual innovation is treating each cell's transcriptome as an unordered set of gene-expression tokens rather than a linear sequence. scGPT introduces a specially designed attention masking scheme within its generative pretraining objective: the model learns to predict masked gene expression values conditioned on observed genes, while simultaneously producing a cell-level embedding that captures the global state of the cell. This two-step masked generative pretraining strategy allows the model to absorb regulatory structure, co-expression patterns, and cell-state information from tens of millions of diverse cells without requiring explicit biological annotations.

The flagship whole-human model was pretrained on 33 million normal human cells drawn from diverse tissues. Additional organ-specific models — including a brain model (13.2 million cells) and a blood and bone marrow model (10.3 million cells) — are distributed through the team's pretrained model zoo, enabling more context-appropriate initialization for tissue-focused studies. The pretrained models can be fine-tuned efficiently on downstream datasets, following the transfer learning paradigm that has proven powerful in natural language processing.

Key Features

Generative pretraining on non-sequential data: Unlike language models that operate on ordered token sequences, scGPT uses a masked multi-head attention design tailored for unordered gene sets, enabling it to learn from transcriptomic data without imposing artificial sequence assumptions on gene order.
Joint cell and gene embeddings: The model simultaneously learns cell-level embeddings (capturing global cell state) and gene-level embeddings (capturing per-gene context), enabling both cell-centric and gene-centric downstream analyses from a single model.
Flexible condition tokens: Gene, expression, and condition tokens (encoding modality, batch identity, perturbation condition, and other metadata) are all incorporated as input, allowing the model to disentangle biological signal from technical covariates during both pretraining and fine-tuning.
Multi-task fine-tuning: A single pretrained checkpoint can be adapted for cell type annotation, multi-batch integration, multi-omic integration, gene regulatory network inference, and genetic perturbation response prediction — covering the most common tasks in single-cell workflows.
Organ-specific model zoo: Alongside the whole-human model, tissue-specific variants pretrained on brain and blood/bone marrow data are publicly available, providing a better initialization prior when working within a specific organ context.
Gene network inference: scGPT's attention weights can be interpreted as gene-gene interaction scores, enabling unsupervised extraction of gene regulatory networks as a byproduct of the learned representations.

Technical Details

scGPT is a 53-million parameter transformer with 12 stacked transformer blocks, each using 8 attention heads, and a model embedding dimension of 512. The architecture extends the standard transformer by incorporating three categories of input tokens per cell: gene identity tokens, binned expression value tokens, and condition tokens that encode experimental metadata such as batch label or perturbation condition. A specialized causal-style attention mask is used during generative pretraining to prevent information leakage across masked positions while still allowing the model to build a holistic cell embedding.

Pretraining used 33 million normal human cells from the scRNA-seq atlas, drawn from the Chan Zuckerberg CELLxGENE Discover corpus. The pretraining objective is a two-step masked gene expression prediction: first predicting masked gene expression bins and a global cell embedding from unmasked genes, then refining predictions iteratively using the inferred cell embedding. For downstream fine-tuning, task-specific heads are added to the pretrained backbone. The model accepts up to several thousand gene tokens per cell, covering the high-dimensional nature of single-cell transcriptomes without hard truncation of the gene vocabulary.

Benchmark evaluations reported in the paper show competitive or superior performance relative to models such as Geneformer and scBERT on fine-tuned cell type annotation tasks across multiple tissue datasets. However, independent zero-shot benchmarks published after the paper's release have found that scGPT's embeddings, like those of other single-cell foundation models, can underperform simpler baseline methods (e.g., highly variable gene selection combined with scVI) in certain clustering and batch correction settings — a limitation the community is actively investigating.

Applications

scGPT is applicable across the standard analytical pipeline in single-cell genomics. Researchers use the fine-tuned model for automated cell type annotation of new datasets, particularly when labeled reference data are limited. The model's batch-aware condition tokens enable multi-batch and multi-omic integration without requiring dataset-specific re-preprocessing. Drug discovery groups have applied scGPT to perturbation response prediction, asking how gene expression shifts after CRISPR knockouts or drug treatments, with the model fine-tuned on paired perturbation datasets such as the Norman et al. or Replogle et al. screens. The gene regulatory network inference capability provides an annotation-free route to discovering transcription factor-target gene relationships from any new dataset by examining attention weights of the pretrained model. The Chan Zuckerberg Institute for Science hosts scGPT on its Virtual Cells Platform (v1.0), reflecting its adoption as a community resource.

Impact

scGPT is among the most widely adopted single-cell foundation models, having rapidly accumulated hundreds of citations since its preprint appeared on bioRxiv in April 2023 and its formal publication in Nature Methods in February 2024. It has stimulated a wave of follow-on work, including scGPT-spatial — a continual pretraining extension on 30 million spatial transcriptomic profiles released in early 2025 — and numerous independent benchmarks evaluating single-cell language models at scale. The model's publicly available weights and model zoo have lowered the barrier to transfer learning in single-cell analysis, and its influence is visible in downstream architectures that adopt similar masked generative pretraining strategies. A key limitation, highlighted by post-publication evaluations, is that zero-shot performance lags behind supervised baselines, and the model's cell embeddings remain sensitive to batch effects in some settings. These findings have productively shifted community focus toward understanding when and how single-cell foundation models deliver genuine gains over established methods.

Overview

Key Features

Generative pretraining on non-sequential data: Unlike language models that operate on ordered token sequences, scGPT uses a masked multi-head attention design tailored for unordered gene sets, enabling it to learn from transcriptomic data without imposing artificial sequence assumptions on gene order.

Joint cell and gene embeddings: The model simultaneously learns cell-level embeddings (capturing global cell state) and gene-level embeddings (capturing per-gene context), enabling both cell-centric and gene-centric downstream analyses from a single model.

Flexible condition tokens: Gene, expression, and condition tokens (encoding modality, batch identity, perturbation condition, and other metadata) are all incorporated as input, allowing the model to disentangle biological signal from technical covariates during both pretraining and fine-tuning.

Multi-task fine-tuning: A single pretrained checkpoint can be adapted for cell type annotation, multi-batch integration, multi-omic integration, gene regulatory network inference, and genetic perturbation response prediction — covering the most common tasks in single-cell workflows.

Organ-specific model zoo: Alongside the whole-human model, tissue-specific variants pretrained on brain and blood/bone marrow data are publicly available, providing a better initialization prior when working within a specific organ context.

Gene network inference: scGPT's attention weights can be interpreted as gene-gene interaction scores, enabling unsupervised extraction of gene regulatory networks as a byproduct of the learned representations.

Technical Details

Applications

Impact

scGPT

Overview

Key Features

Technical Details

Applications

Impact

Citation

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Metrics

GitHub

Citations

Tags

Resources

scGPT

Overview

Key Features

Technical Details

Applications

Impact

Citation

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Metrics

GitHub

Citations

Tags

Resources