Tianjin Medical University Cancer Institute and Hospital
GPT-based generative model pre-trained on 22 million single-cell transcriptomes using rank-based gene encoding for single-cell clustering, trajectory inference, and bulk tumor analysis.
Large language models pre-trained on massive text corpora acquire general linguistic knowledge that transfers to diverse downstream tasks with minimal task-specific fine-tuning. tGPT (transcriptome GPT) applies this principle to single-cell transcriptomics by pre-training a GPT-style autoregressive transformer on millions of single-cell gene expression profiles, treating each transcriptome as an ordered sequence of genes that the model learns to predict autoregressively. The core hypothesis is that by training a generative model to predict the next gene in a rank-ordered transcriptome sequence, the model will internalize statistical regularities in gene co-expression that can be leveraged for downstream single-cell analysis tasks.
tGPT was developed by Xiangchun Li and colleagues at the Tianjin Cancer Institute and Hospital, Tianjin Medical University, and the National Clinical Research Center for Cancer in Tianjin, China, and published in iScience in April 2023. The model's pre-training corpus of 22.3 million single-cell transcriptomes places it among the early large-scale single-cell foundation models and predates or is contemporaneous with other GPT-style single-cell models including scGPT. The specific contribution of tGPT is the demonstration that purely autoregressive pre-training on rank-encoded transcriptomes produces representations that are competitive with supervised methods for cell type clustering and trajectory inference, while also generalizing usefully to bulk tissue analysis for clinical outcome prediction.
The use of rank-based rather than raw-count gene encoding is a deliberate design choice inherited from earlier work such as Geneformer. By converting each cell's expression profile into an ordered list of genes ranked from most to least expressed, tGPT eliminates sensitivity to technical factors such as sequencing depth and library size normalization — a significant advantage when training across heterogeneous public single-cell datasets that were generated with different protocols and read depths. This encoding also creates a natural sequential structure that autoregressive language modeling can exploit: the identity of a highly expressed gene constrains the identities of other genes that tend to be co-expressed.
tGPT is implemented using the GPT-2 architecture from the Hugging Face transformers library. The input to the model is a sequence of gene tokens, where each token corresponds to a gene identifier, and genes are ordered by their expression rank within the cell (most highly expressed first). The model is trained with a standard autoregressive next-token prediction objective: given the first k genes in the rank-ordered sequence, predict the (k+1)-th gene. Cell-level representations are derived by extracting the hidden state of the final transformer layer at the last token position, analogous to the CLS token representation used in BERT-style models, or by averaging across token positions.
The pre-training corpus consisted of 22.3 million single-cell transcriptomes assembled from publicly available human single-cell RNA-seq datasets spanning diverse tissues and cell types. The model was evaluated on four single-cell datasets for clustering and trajectory inference tasks. On clustering benchmarks, tGPT representations were compared against PCA, scVI, and Seurat cluster-based methods using normalized mutual information (NMI) and adjusted Rand index (ARI) metrics, achieving competitive or superior performance. Trajectory inference was evaluated using pseudotime correlation against known developmental orders. For bulk tumor analysis, tGPT features were extracted from TCGA (The Cancer Genome Atlas) RNA-seq profiles and evaluated for their ability to predict somatic mutation burden, tumor purity, clinical stage, and immunotherapy response in published cohorts.
tGPT's primary applications are single-cell transcriptomics analysis tasks that benefit from pre-trained representations: cell type clustering, cell type annotation against reference atlases, and dimensionality reduction for visualization. The model provides an alternative to count-matrix-based methods like Seurat and scVI for these tasks, particularly in settings where training data is limited or heterogeneous. The bulk tumor analysis applications are a distinctive feature of tGPT relative to most other single-cell foundation models, which focus exclusively on single-cell data: by demonstrating that tGPT features from bulk RNA-seq associate with clinical outcomes, the authors show that the model's representations capture biologically meaningful variation at a level relevant to translational research. This makes tGPT potentially useful for biomarker discovery in cohorts where only bulk RNA-seq is available rather than single-cell data.
tGPT contributed to the growing evidence in 2023 that autoregressive pre-training on large single-cell corpora could produce useful representations for downstream analysis, alongside contemporaneous work including scGPT and Geneformer. The model's publication in iScience expanded the conversation beyond elite journals and contributed to broader community awareness of the single-cell foundation model paradigm. Its use of the GPT-2 backbone — a widely understood and well-tooled architecture — made it accessible to researchers already familiar with the NLP literature who wished to apply language model ideas to transcriptomics. The bulk tumor analysis results pointed to an underexplored direction: that models pre-trained on single-cell data may also capture features relevant to bulk tissue biology, potentially bridging the single-cell and bulk genomics communities. The open-source release on GitHub with clear documentation facilitated adoption and reproduction.
Sources: