GeneCompass

Knowledge-informed cross-species foundation model pre-trained on 101 million human and mouse single-cell transcriptomes to decipher gene regulation.

Released: October 2024

GeneCompass is a cross-species foundation model for single-cell transcriptomics that addresses a core challenge in the field: most single-cell models are trained on data from a single organism, limiting their ability to uncover regulatory principles that are conserved across species. Developed by researchers across three institutes of the Chinese Academy of Sciences — the Institute of Zoology, the Institute of Computing Technology, and the Institute of Automation — GeneCompass was pre-trained on the scCompass-126M dataset, comprising 101,768,420 single-cell transcriptomes drawn from both human and mouse. After quality filtering, the dataset retained 53.6 million human cells and 48.2 million mouse cells spanning 17,465 homologous genes, making it one of the largest cross-species training corpora assembled for a single-cell foundation model at the time of publication.

What distinguishes GeneCompass from contemporaries such as Geneformer and scGPT is its explicit integration of four categories of biological prior knowledge directly into the pre-training procedure. Rather than treating gene expression profiles as sequences of unlabeled tokens, GeneCompass encodes each gene with embeddings that capture its regulatory relationships, promoter sequence context, gene family membership, and co-expression patterns. This knowledge-informed design allows the model to learn biologically grounded gene representations during self-supervised pre-training, without requiring task-specific supervision. The model was published in Cell Research in December 2024 after an earlier preprint in September 2023.

The work includes a notable experimental validation: GeneCompass-predicted transcription factors (NR5A1, GATA4, WT1, TCF21, NR2F1) successfully induced differentiation of human embryonic stem cells toward gonadal progenitor lineages, confirmed by immunofluorescence and transcriptome analysis showing upregulation of steroid synthesis genes. This wet-lab validation provides direct evidence that computational predictions from the model can generate actionable experimental hypotheses.

Key Features

Four-category knowledge integration: Prior biological knowledge from gene regulatory networks (GRNs built from ENCODE data via PECA2), promoter sequences (DNA-BERT fine-tuned on 2,500-base windows around the TSS), gene family annotations (HGNC database), and co-expression relationships (Pearson correlation > 0.8) is encoded as gene-level embeddings during pre-training.
Cross-species pre-training: Training on 101M jointly processed human and mouse transcriptomes enables the model to capture conserved regulatory logic across mammals, outperforming single-species models on cross-species cell annotation tasks by up to 7.5% on retina data using CAME.
Two model scales: GeneCompass is released as GeneCompass-Small (6-layer transformer) and GeneCompass-Base (12-layer transformer with 768-dimensional embeddings), allowing users to balance computational cost against performance.
Versatile fine-tuning: The pre-trained backbone can be adapted to cell-type annotation, gene regulatory network inference, gene perturbation prediction, drug dose-response modeling, and gene dosage sensitivity prediction.
Experimental validation: Candidate transcription factors identified by GeneCompass for gonadal lineage specification were experimentally confirmed to induce the target cell fate in human embryonic stem cells, closing the loop between computational prediction and biological validation.

Technical Details

GeneCompass is built on a transformer encoder architecture, available in two configurations: a 6-layer Small model and a 12-layer Base model with 768-dimensional hidden representations. Both variants use a self-attention mechanism that can capture long-range pairwise relationships between genes within a cellular context. Input cells are represented as ranked lists of up to 2,048 genes, ordered by expression level. Each gene token is augmented with embeddings from the four prior knowledge sources — GRN connections (derived from 84 mouse and 76 human cell-type-specific regulatory networks), promoter context vectors (from DNA-BERT applied to 2,500 bp windows around transcription start sites), gene family membership (1,645 human and 1,539 mouse families from HGNC), and co-expression associations — before entering the transformer stack. A species token is prepended to each input cell to condition the model on the organism of origin.

Pre-training used a masked gene modeling objective on the scCompass-126M corpus, which aggregated data from GEO, ArrayExpress, CNCB, and CELLxGENE. On downstream benchmarks, GeneCompass-Base improved macro-F1 for cell-type annotation by 3–8% over Geneformer on human datasets and 10–19% on mouse datasets. For gene perturbation prediction, the model achieved a 15.4% reduction in MSE and a 2.2% improvement in Spearman correlation for top-20 differentially expressed genes relative to baseline models. On gene dosage sensitivity prediction it reached an AUC of 0.95, and on GRN inference it outperformed scGPT and DeepSEM by AUPRC.

Applications

GeneCompass is intended for researchers working in single-cell genomics who need to characterize transcriptional regulation, annotate cell types, or predict the effects of genetic perturbations. Cell-type annotation workflows benefit from the model's cross-species representations, which enable transfer of annotations from well-studied organisms to less-characterized ones. Developmental biologists can use the GRN inference and in silico perturbation modules to generate ranked lists of candidate transcription factors for driving cell fate transitions, as demonstrated by the gonadal lineage experiment. Pharmacogenomics groups can apply the drug dose-response module to predict cellular responses to compound treatment at the transcriptomic level. The model is compatible with the GEARS framework for perturbation prediction, broadening integration with existing computational pipelines.

Impact

GeneCompass contributes to a growing body of work demonstrating that foundation models pre-trained on large single-cell corpora can generalize across diverse biological tasks. By combining scale (101M cells) with explicit biological prior knowledge, the model offers an alternative design philosophy to purely data-driven approaches like scGPT and Geneformer: biological knowledge and scale are complementary, not mutually exclusive. The experimental validation of predicted transcription factors in a living cell system is an important proof-of-concept showing that model outputs can guide real experimental decisions. An honest limitation is that GeneCompass, like all current single-cell foundation models, focuses on transcriptomics and does not directly model epigenomics, spatial context, or protein-level phenotypes. The cross-species design is currently restricted to human and mouse; extension to other organisms would require additional data collection and homolog mapping. Code and pre-trained checkpoints for both Small and Base variants are publicly available through the xCompass-AI GitHub organization.

Citation

GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

Yang, X., et al. (2023) GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. bioRxiv.

DOI: 10.1038/s41422-024-01034-y

Recent citations

Papers that recently cited this model.

SCTGE infers transformer-based graph embeddings to improve cell-cell interaction identification and cell identity annotations.
Yichong Si, Chenxi Li, Mingguang Shi
Computational biology and chemistry · Aug 2026
0
A Survey of Single-Cell Clustering Methods Based on Encoder Types: Embedded, Topological, and Multi-Modal
Dayu Hu, Fengyue Zhang, Zhixiang Wang, et al.
Archives of Computational Methods in Engineering · Jul 2026
0
Tokenizing single-cell transcriptomes as a native language for large language models
Chuxi Xiao, Yuang Ding, Haiyang Bian, et al.
bioRxiv · Jul 2026
0

Top citations

The most-cited papers that cite this model.

Profiling cell identity and tissue architecture with single-cell and spatial transcriptomics
G. Gulati, J. P. D'Silva, Yunhe Liu, et al.
Nature reviews. Molecular cell biology · Aug 2024
186
Transformers in single-cell omics: a review and new perspectives
Artur Szałata, Karin Hrovatin, Sören Becker, et al.
Nature Methods · Aug 2024
141Influential
Nicheformer: a foundation model for single-cell and spatial omics
A. Schaar, Alejandro Tejada-Lapuerta, G. Palla, et al.
bioRxiv · Oct 2024
124Influential
A Comprehensive Survey of Foundation Models in Medicine
Wasif Khan, Seowung Leem, Kyle B. See, et al.
IEEE Reviews in Biomedical Engineering · Jun 2024
115
AI-driven multi-omics integration for multi-scale predictive modeling of genotype-environment-phenotype relationships
You Wu, Lei Xie
Computational and Structural Biotechnology Journal · Jul 2024
114Influential

Citations

Total Citations137

Influential13

References76

GitHub

Stars119

Forks22

Open Issues20

Contributors3

Last Push5mo ago

LanguageJupyter Notebook

Fields of citing research

Computer Science94%
Biology87%
Medicine64%
Environmental Science3%
Engineering3%
Materials Science2%
Linguistics2%
Chemistry1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

32Closed

Usability — can I run it?27

Reproducibility — can I retrain it?23

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Research Paper Dataset Dataset

Key Features

Four-category knowledge integration: Prior biological knowledge from gene regulatory networks (GRNs built from ENCODE data via PECA2), promoter sequences (DNA-BERT fine-tuned on 2,500-base windows around the TSS), gene family annotations (HGNC database), and co-expression relationships (Pearson correlation > 0.8) is encoded as gene-level embeddings during pre-training.

Cross-species pre-training: Training on 101M jointly processed human and mouse transcriptomes enables the model to capture conserved regulatory logic across mammals, outperforming single-species models on cross-species cell annotation tasks by up to 7.5% on retina data using CAME.

Two model scales: GeneCompass is released as GeneCompass-Small (6-layer transformer) and GeneCompass-Base (12-layer transformer with 768-dimensional embeddings), allowing users to balance computational cost against performance.

Versatile fine-tuning: The pre-trained backbone can be adapted to cell-type annotation, gene regulatory network inference, gene perturbation prediction, drug dose-response modeling, and gene dosage sensitivity prediction.

Experimental validation: Candidate transcription factors identified by GeneCompass for gonadal lineage specification were experimentally confirmed to induce the target cell fate in human embryonic stem cells, closing the loop between computational prediction and biological validation.

Technical Details

Applications

Impact

Citation

GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

Yang, X., et al. (2023) GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. bioRxiv.

DOI: 10.1038/s41422-024-01034-y

GeneCompass

#Key Features

#Technical Details

#Applications

#Impact

Citation

GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

GeneCompass

#Key Features

#Technical Details

#Applications

#Impact

Citation

GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact