bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

CellFM

Sun Yat-sen University

An 800M-parameter single-cell foundation model pre-trained on 100 million human cells via a RetNet architecture for cell annotation, perturbation prediction, and gene analysis.

Released: 2024
Parameters: 800,000,000

Overview

CellFM is a large-scale single-cell foundation model developed by researchers at Sun Yat-sen University and collaborating Chinese institutions. At 800 million parameters trained on approximately 102 million human cells, it represents an eight-fold increase in scale over prior single-species single-cell models and is among the largest models trained exclusively on human transcriptomics data.

The central design argument is human specificity. Prior single-cell foundation models such as UCE and GeneCompass were trained on multi-species datasets; CellFM's authors hypothesize that mixing human and non-human data dilutes the representation of human-specific gene programs and cellular states. By restricting the training corpus to human scRNA-seq data, the model devotes its full capacity to the structure of human gene expression.

CellFM was first posted to bioRxiv in June 2024 and published in Nature Communications in May 2025. Pre-trained weights are available on HuggingFace, and fine-tuning code is available via the project's GitHub repository.

Key Features

  • 800M-parameter scale: With 800 million parameters, CellFM is substantially larger than contemporary single-cell models, providing greater representational capacity for the diversity of human cell states.
  • Human-only training corpus: Trained on 102,304,686 cells from 19,914 samples across multiple organs, tissues, and sequencing technologies — restricted entirely to human data to maximize relevance for human biology.
  • RetNet backbone with linear complexity: Uses a modified Enhanced Retentive Network (ERetNet) architecture that replaces quadratic self-attention with a recurrent retention mechanism, making training at this scale computationally tractable.
  • Masked gene expression pre-training: Learns cellular representations by predicting masked gene expression values from surrounding context, an objective analogous to masked language modeling in NLP and well-suited to the sparse, high-dimensional structure of scRNA-seq data.
  • Broad task generalization: A single pre-trained model transfers to cell type annotation, perturbation response prediction, gene function prediction, and gene-gene relationship inference without task-specific architectural modifications.

Technical Details

CellFM is built on an ERetNet backbone — a modified Retentive Network that adapts the retention mechanism of RetNet for single-cell transcriptomics. Two key architectural changes distinguish CellFM from the base RetNet: a gated bilinear network replaces the standard feedforward sublayer to improve representational capacity for sparse gene expression profiles, and DeepNorm normalization substitutes conventional LayerNorm to stabilize training at depth. Ablation studies in the published paper confirm each modification contributes independently, with their removal degrading average AUPR by 0.8% and 0.9% respectively on gene function prediction benchmarks.

Genes are treated as tokens with expression levels encoded as continuous input features. The model was pre-trained on Huawei's MindSpore framework using distributed training across a large compute cluster; PyTorch-compatible weights are available via HuggingFace. On downstream benchmarks, CellFM outperforms Geneformer, scGPT, scFoundation, UCE, and GeneCompass on cell type annotation (1.6–1.94% AUPR improvement over nearest competitors), perturbation prediction (~1% PCC improvement over scFoundation), and gene function prediction, while also achieving top performance on gene-gene relationship inference tasks.

Applications

CellFM is suited for computational biologists analyzing human single-cell RNA sequencing data. Its primary use cases include automated cell type annotation for large-scale atlas projects and rare cell type identification, perturbation response prediction for drug discovery and functional genomics screens, gene ontology function inference from expression context, and gene regulatory network reconstruction from learned co-expression embeddings. The model can also generate unified cell embeddings across heterogeneous datasets for batch-corrected comparison of cell states.

Impact

CellFM's publication in Nature Communications and the availability of pre-trained weights on HuggingFace have made it accessible to the broader single-cell community. Its scale and human specificity offer a meaningful benchmark advance over prior models, particularly on perturbation and gene function tasks where biological signal is subtle. Notable limitations constrain its scope: the model is not applicable to non-human organisms; it targets scRNA-seq specifically and does not natively handle ATAC-seq, spatial transcriptomics, or protein-level data; and performance may degrade with very low-depth sequencing where dropout effects are severe. The original MindSpore training environment adds friction for PyTorch-native workflows, though this is mitigated by the HuggingFace weight release.

Citation

CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells

Zeng, Y., et al. (2024) CellFM: a large-scale foundation model pre-trained on transcriptomics of 100 million human cells. bioRxiv.

DOI: 10.1038/s41467-025-59926-5

Metrics

GitHub

Stars103
Forks16
Open Issues10
Contributors6
Last Push8mo ago
LanguageJupyter Notebook

Citations

Total Citations52
Influential4
References80

HuggingFace

Downloads0
Likes6
Last Modified1y ago

Tags

transformerfoundation modelcell biologytranscriptomics

Resources

GitHub RepositoryResearch PaperResearch PaperHuggingFace Model