Knowledge-informed cross-species foundation model pre-trained on 101M human and mouse single-cell transcriptomes to decipher universal gene regulatory mechanisms.
GeneCompass is a cross-species foundation model for single-cell transcriptomics that addresses a core challenge in the field: most single-cell models are trained on data from a single organism, limiting their ability to uncover regulatory principles that are conserved across species. Developed by researchers across three institutes of the Chinese Academy of Sciences — the Institute of Zoology, the Institute of Computing Technology, and the Institute of Automation — GeneCompass was pre-trained on the scCompass-126M dataset, comprising 101,768,420 single-cell transcriptomes drawn from both human and mouse. After quality filtering, the dataset retained 53.6 million human cells and 48.2 million mouse cells spanning 17,465 homologous genes, making it one of the largest cross-species training corpora assembled for a single-cell foundation model at the time of publication.
What distinguishes GeneCompass from contemporaries such as Geneformer and scGPT is its explicit integration of four categories of biological prior knowledge directly into the pre-training procedure. Rather than treating gene expression profiles as sequences of unlabeled tokens, GeneCompass encodes each gene with embeddings that capture its regulatory relationships, promoter sequence context, gene family membership, and co-expression patterns. This knowledge-informed design allows the model to learn biologically grounded gene representations during self-supervised pre-training, without requiring task-specific supervision. The model was published in Cell Research in December 2024 after an earlier preprint in September 2023.
The work includes a notable experimental validation: GeneCompass-predicted transcription factors (NR5A1, GATA4, WT1, TCF21, NR2F1) successfully induced differentiation of human embryonic stem cells toward gonadal progenitor lineages, confirmed by immunofluorescence and transcriptome analysis showing upregulation of steroid synthesis genes. This wet-lab validation provides direct evidence that computational predictions from the model can generate actionable experimental hypotheses.
GeneCompass is built on a transformer encoder architecture, available in two configurations: a 6-layer Small model and a 12-layer Base model with 768-dimensional hidden representations. Both variants use a self-attention mechanism that can capture long-range pairwise relationships between genes within a cellular context. Input cells are represented as ranked lists of up to 2,048 genes, ordered by expression level. Each gene token is augmented with embeddings from the four prior knowledge sources — GRN connections (derived from 84 mouse and 76 human cell-type-specific regulatory networks), promoter context vectors (from DNA-BERT applied to 2,500 bp windows around transcription start sites), gene family membership (1,645 human and 1,539 mouse families from HGNC), and co-expression associations — before entering the transformer stack. A species token is prepended to each input cell to condition the model on the organism of origin.
Pre-training used a masked gene modeling objective on the scCompass-126M corpus, which aggregated data from GEO, ArrayExpress, CNCB, and CELLxGENE. On downstream benchmarks, GeneCompass-Base improved macro-F1 for cell-type annotation by 3–8% over Geneformer on human datasets and 10–19% on mouse datasets. For gene perturbation prediction, the model achieved a 15.4% reduction in MSE and a 2.2% improvement in Spearman correlation for top-20 differentially expressed genes relative to baseline models. On gene dosage sensitivity prediction it reached an AUC of 0.95, and on GRN inference it outperformed scGPT and DeepSEM by AUPRC.
GeneCompass is intended for researchers working in single-cell genomics who need to characterize transcriptional regulation, annotate cell types, or predict the effects of genetic perturbations. Cell-type annotation workflows benefit from the model's cross-species representations, which enable transfer of annotations from well-studied organisms to less-characterized ones. Developmental biologists can use the GRN inference and in silico perturbation modules to generate ranked lists of candidate transcription factors for driving cell fate transitions, as demonstrated by the gonadal lineage experiment. Pharmacogenomics groups can apply the drug dose-response module to predict cellular responses to compound treatment at the transcriptomic level. The model is compatible with the GEARS framework for perturbation prediction, broadening integration with existing computational pipelines.
GeneCompass contributes to a growing body of work demonstrating that foundation models pre-trained on large single-cell corpora can generalize across diverse biological tasks. By combining scale (101M cells) with explicit biological prior knowledge, the model offers an alternative design philosophy to purely data-driven approaches like scGPT and Geneformer: biological knowledge and scale are complementary, not mutually exclusive. The experimental validation of predicted transcription factors in a living cell system is an important proof-of-concept showing that model outputs can guide real experimental decisions. An honest limitation is that GeneCompass, like all current single-cell foundation models, focuses on transcriptomics and does not directly model epigenomics, spatial context, or protein-level phenotypes. The cross-species design is currently restricted to human and mouse; extension to other organisms would require additional data collection and homolog mapping. Code and pre-trained checkpoints for both Small and Base variants are publicly available through the xCompass-AI GitHub organization.
Yang, X., et al. (2023) GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. bioRxiv.
DOI: 10.1038/s41422-024-01034-y