bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

scLong

Chinese Academy of Sciences

Billion-parameter single-cell foundation model performing full self-attention across all 28,000 human genes, integrating Gene Ontology priors via GCN for long-range gene context capture in transcriptomics.

Released: 2026
Parameters: 1,000,000,000

Overview

scLong is a billion-parameter single-cell foundation model published in Nature Communications in 2026 that performs full self-attention across all approximately 28,000 protein-coding human genes, removing the gene-selection step that prior single-cell foundation models such as scGPT and Geneformer rely on. The model integrates Gene Ontology (GO) knowledge through a graph convolutional network whose embeddings are concatenated to gene tokens, providing biological priors that complement the data-driven attention signal.

scLong is the first single-cell foundation model to operate over the complete human transcriptome at this scale and demonstrates SOTA performance on perturbation response prediction, cancer drug response, cell-type annotation, and batch integration.

Key Features

  • Full-transcriptome attention: Attends over all approximately 28,000 protein-coding human genes per cell, removing the gene-selection step required by scGPT, Geneformer, and scFoundation.
  • Gene Ontology integration: GO priors injected via GCN-derived gene embeddings concatenated to learned tokens, supplementing data-driven signal with curated knowledge.
  • Billion-parameter scale: One of the largest single-cell FMs to date.
  • Strong perturbation prediction: Outperforms prior single-cell FMs on held-out perturbation prediction benchmarks.
  • Cancer drug response transfer: Effective for predicting cellular response to anti-cancer drugs in zero-shot and fine-tuned settings.

Technical Details

scLong uses a transformer architecture with sparse-attention adaptations to manage the cost of full-transcriptome attention. Each gene token is augmented with a GO-derived embedding produced by a GCN trained on the GO biological-process hierarchy. The model is pretrained with masked-gene prediction on a large pan-tissue scRNA-seq corpus. The published paper reports architecture, training corpus, ablations, and benchmark comparisons against scGPT, Geneformer, scFoundation, and scBERT.

Applications

scLong is suited for translational single-cell research groups working on perturbation response, drug response, and cell-type annotation in heterogeneous tissues. The full-transcriptome attention is particularly valuable for studies where pathway-level effects are expected and where pre-selected gene lists may miss relevant signal.

Impact

scLong demonstrates that scaling single-cell foundation models to full-transcriptome attention is technically feasible and delivers measurable gains over the prior generation of FMs that operate on selected gene subsets. The integration of curated biological knowledge through GO-derived embeddings provides a useful template for combining data-driven and knowledge-driven signal in single-cell modeling.

Citation

scLong: a billion-parameter foundation model for capturing long-range gene context in single-cell transcriptomics

Bai, D., et al. (2026) scLong: a billion-parameter foundation model for capturing long-range gene context in single-cell transcriptomics. Nature Communications.

DOI: 10.1038/s41467-026-69102-y

Metrics

Citations

Total Citations0
Influential0
References95

Tags

perturbation predictioncancer drug responsecell type annotationbatch integrationtransformergraph neural networkself-supervisedfoundation modelsingle-cell transcriptomegene ontology

Resources

Research Paper