Billion-parameter single-cell foundation model performing full self-attention across all 28,000 human genes, integrating Gene Ontology priors via GCN for long-range gene context capture in transcriptomics.
scLong is a billion-parameter single-cell foundation model published in Nature Communications in 2026 that performs full self-attention across all approximately 28,000 protein-coding human genes, removing the gene-selection step that prior single-cell foundation models such as scGPT and Geneformer rely on. The model integrates Gene Ontology (GO) knowledge through a graph convolutional network whose embeddings are concatenated to gene tokens, providing biological priors that complement the data-driven attention signal.
scLong is the first single-cell foundation model to operate over the complete human transcriptome at this scale and demonstrates SOTA performance on perturbation response prediction, cancer drug response, cell-type annotation, and batch integration.
scLong uses a transformer architecture with sparse-attention adaptations to manage the cost of full-transcriptome attention. Each gene token is augmented with a GO-derived embedding produced by a GCN trained on the GO biological-process hierarchy. The model is pretrained with masked-gene prediction on a large pan-tissue scRNA-seq corpus. The published paper reports architecture, training corpus, ablations, and benchmark comparisons against scGPT, Geneformer, scFoundation, and scBERT.
scLong is suited for translational single-cell research groups working on perturbation response, drug response, and cell-type annotation in heterogeneous tissues. The full-transcriptome attention is particularly valuable for studies where pathway-level effects are expected and where pre-selected gene lists may miss relevant signal.
scLong demonstrates that scaling single-cell foundation models to full-transcriptome attention is technically feasible and delivers measurable gains over the prior generation of FMs that operate on selected gene subsets. The integration of curated biological knowledge through GO-derived embeddings provides a useful template for combining data-driven and knowledge-driven signal in single-cell modeling.
Bai, D., et al. (2026) scLong: a billion-parameter foundation model for capturing long-range gene context in single-cell transcriptomics. Nature Communications.
DOI: 10.1038/s41467-026-69102-y