bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Single-cell foundation models
Single-cell

scLong

Chinese Academy of Sciences

Billion-parameter single-cell foundation model performing full self-attention across all 28,000 human genes, integrating Gene Ontology priors via GCN for long-range gene context capture in transcriptomics.

Released: April 2026
Parameters: 1 Billion

scLong is a billion-parameter single-cell foundation model published in Nature Communications in 2026 that performs full self-attention across all approximately 28,000 protein-coding human genes, removing the gene-selection step that prior single-cell foundation models such as scGPT and Geneformer rely on. The model integrates Gene Ontology (GO) knowledge through a graph convolutional network whose embeddings are concatenated to gene tokens, providing biological priors that complement the data-driven attention signal.

scLong is the first single-cell foundation model to operate over the complete human transcriptome at this scale and demonstrates SOTA performance on perturbation response prediction, cancer drug response, cell-type annotation, and batch integration.

#Key Features

  • Full-transcriptome attention: Attends over all approximately 28,000 protein-coding human genes per cell, removing the gene-selection step required by scGPT, Geneformer, and scFoundation.
  • Gene Ontology integration: GO priors injected via GCN-derived gene embeddings concatenated to learned tokens, supplementing data-driven signal with curated knowledge.
  • Billion-parameter scale: One of the largest single-cell FMs to date.
  • Strong perturbation prediction: Outperforms prior single-cell FMs on held-out perturbation prediction benchmarks.
  • Cancer drug response transfer: Effective for predicting cellular response to anti-cancer drugs in zero-shot and fine-tuned settings.

#Technical Details

scLong uses a transformer architecture with sparse-attention adaptations to manage the cost of full-transcriptome attention. Each gene token is augmented with a GO-derived embedding produced by a GCN trained on the GO biological-process hierarchy. The model is pretrained with masked-gene prediction on a large pan-tissue scRNA-seq corpus. The published paper reports architecture, training corpus, ablations, and benchmark comparisons against scGPT, Geneformer, scFoundation, and scBERT.

#Applications

scLong is suited for translational single-cell research groups working on perturbation response, drug response, and cell-type annotation in heterogeneous tissues. The full-transcriptome attention is particularly valuable for studies where pathway-level effects are expected and where pre-selected gene lists may miss relevant signal.

#Impact

scLong demonstrates that scaling single-cell foundation models to full-transcriptome attention is technically feasible and delivers measurable gains over the prior generation of FMs that operate on selected gene subsets. The integration of curated biological knowledge through GO-derived embeddings provides a useful template for combining data-driven and knowledge-driven signal in single-cell modeling.

Citation

scLong: a billion-parameter foundation model for capturing long-range gene context in single-cell transcriptomics

Bai, D., et al. (2026) scLong: a billion-parameter foundation model for capturing long-range gene context in single-cell transcriptomics. Nature Communications.

DOI: 10.1038/s41467-026-69102-y

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References95

GitHub

Stars21
Forks6
Open Issues7
Contributors1
Last Push7mo ago
LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
29Closed
Usability — can I run it?24
Reproducibility — can I retrain it?21
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

batch_integrationcancer_drug_responsecell_type_annotationfoundation_modelgene_ontologygraph_neural_networkperturbation_predictionself_supervisedsingle_cell_transcriptometransformer

Resources

GitHub RepositoryResearch PaperResearch Paper