Broad Institute / Dana-Farber Cancer Institute
Transformer-based foundation model pretrained on ~30 million single-cell transcriptomes for context-aware gene network predictions and therapeutic target discovery.
Geneformer is a context-aware, attention-based foundation model pretrained on Genecorpus-30M — a large-scale corpus of approximately 29.9 million human single-cell transcriptomes spanning a broad range of tissues and cell states. Developed by Christina Theodoris and colleagues at the Broad Institute of MIT and Harvard, Dana-Farber Cancer Institute, and Boston Children's Hospital, and published in Nature in May 2023, Geneformer addresses a fundamental bottleneck in network biology: the difficulty of making accurate predictions about gene regulatory networks when disease-specific or tissue-specific training data is scarce.
The central innovation is a novel input representation called rank-value encoding. Rather than feeding raw expression counts into the model, each single-cell transcriptome is converted into an ordered list of genes ranked from most to least expressed — normalized against expression levels across the full pretraining corpus. This encoding captures relative gene activity within each cell while discarding technical noise introduced by sequencing depth variation, and it allows the model to represent gene states in terms of network context rather than raw measurement values. Pretraining on this ranked representation with a masked gene prediction objective allowed Geneformer to internalize the logic of gene regulatory networks in a fully self-supervised manner, encoding network hierarchy directly into its attention weights without any explicit supervision.
The result is a model that can be fine-tuned on small, task-specific datasets to perform predictions that would otherwise require far more labeled examples. Geneformer demonstrated this capability across a diverse set of downstream tasks in chromatin and network dynamics, and it identified candidate therapeutic targets for cardiomyopathy that were subsequently validated experimentally in an iPSC-derived cardiomyocyte model.
The original Geneformer (V1) is a BERT-style transformer encoder with 6 layers, 4 attention heads per layer, 256-dimensional embeddings, and approximately 10 million parameters. The model accepts an input context of 2048 gene tokens, sufficient to fully represent the transcriptomes of 93% of cells in Genecorpus-30M. The vocabulary consists of approximately 25,000 protein-coding and non-coding RNA genes. Pretraining used a standard masked language model objective applied to the rank-ordered gene sequences, run on Genecorpus-30M — assembled from publicly available human single-cell RNA-seq datasets spanning diverse tissues.
Later V2 models scale the architecture substantially: the 95M-parameter variant uses 20 transformer layers, 512–896-dimensional embeddings, 8–14 attention heads, and a context length of 4096 tokens, trained on an updated Genecorpus-103M corpus. A quantized QLoRA fine-tuning approach was shown to match full-precision performance across four biologically diverse downstream tasks while reducing GPU memory requirements by approximately one-third.
Fine-tuning downstream tasks validated in the original publication include prediction of chromatin accessibility dynamics, transcription factor dosage sensitivity, and gene network centrality. Applied to dilated cardiomyopathy with limited patient samples, Geneformer prioritized candidate therapeutic targets that were subsequently validated in iPSC-derived cardiomyocytes, demonstrating measurable improvements in contractile force generation.
Geneformer is well suited for researchers who need to extract biological insight from single-cell RNA-seq data in contexts where labeled training examples are limited. Typical use cases include identifying disease-relevant gene regulatory networks, predicting the consequences of genetic perturbations without running large-scale CRISPR screens, and classifying cell states or gene expression programs using transfer learning from the pretrained model. The in silico perturbation framework is particularly useful for therapeutic target discovery in rare diseases or conditions with small patient cohorts. The model has also been applied to batch integration tasks across datasets, and its attention-based architecture makes it interpretable in terms of which gene-gene interactions drive a given prediction.
Geneformer represents one of the first foundation models for single-cell transcriptomics, demonstrating that the pretraining-then-fine-tuning paradigm from NLP can be adapted to gene expression data with genuine biological payoff. The Nature publication has accumulated substantial citations and established a blueprint for subsequent single-cell foundation models, including scGPT and scFoundation. The model's prediction of cardiomyopathy therapeutic targets — later validated experimentally — stands as a concrete proof-of-concept for AI-accelerated drug target identification. A key limitation is that rank-value encoding discards absolute expression magnitude, which may reduce sensitivity for tasks where fold-change information is biologically important. The model was originally pretrained on human data only, though subsequent work has extended the approach to mouse transcriptomes. As with all foundation models, fine-tuning performance depends on the biological proximity of the pretraining corpus to the target task.
Theodoris, C. V., et al. (2023) Transfer learning enables predictions in network biology. Nature.
DOI: 10.1038/s41586-023-06139-9