bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

GeneCompass

Chinese Academy of Sciences

Knowledge-informed cross-species foundation model pre-trained on 101M human and mouse single-cell transcriptomes to decipher universal gene regulatory mechanisms.

Released: 2024

Overview

GeneCompass is a cross-species foundation model for single-cell transcriptomics that addresses a core challenge in the field: most single-cell models are trained on data from a single organism, limiting their ability to uncover regulatory principles that are conserved across species. Developed by researchers across three institutes of the Chinese Academy of Sciences — the Institute of Zoology, the Institute of Computing Technology, and the Institute of Automation — GeneCompass was pre-trained on the scCompass-126M dataset, comprising 101,768,420 single-cell transcriptomes drawn from both human and mouse. After quality filtering, the dataset retained 53.6 million human cells and 48.2 million mouse cells spanning 17,465 homologous genes, making it one of the largest cross-species training corpora assembled for a single-cell foundation model at the time of publication.

What distinguishes GeneCompass from contemporaries such as Geneformer and scGPT is its explicit integration of four categories of biological prior knowledge directly into the pre-training procedure. Rather than treating gene expression profiles as sequences of unlabeled tokens, GeneCompass encodes each gene with embeddings that capture its regulatory relationships, promoter sequence context, gene family membership, and co-expression patterns. This knowledge-informed design allows the model to learn biologically grounded gene representations during self-supervised pre-training, without requiring task-specific supervision. The model was published in Cell Research in December 2024 after an earlier preprint in September 2023.

The work includes a notable experimental validation: GeneCompass-predicted transcription factors (NR5A1, GATA4, WT1, TCF21, NR2F1) successfully induced differentiation of human embryonic stem cells toward gonadal progenitor lineages, confirmed by immunofluorescence and transcriptome analysis showing upregulation of steroid synthesis genes. This wet-lab validation provides direct evidence that computational predictions from the model can generate actionable experimental hypotheses.

Key Features

  • Four-category knowledge integration: Prior biological knowledge from gene regulatory networks (GRNs built from ENCODE data via PECA2), promoter sequences (DNA-BERT fine-tuned on 2,500-base windows around the TSS), gene family annotations (HGNC database), and co-expression relationships (Pearson correlation > 0.8) is encoded as gene-level embeddings during pre-training.
  • Cross-species pre-training: Training on 101M jointly processed human and mouse transcriptomes enables the model to capture conserved regulatory logic across mammals, outperforming single-species models on cross-species cell annotation tasks by up to 7.5% on retina data using CAME.
  • Two model scales: GeneCompass is released as GeneCompass-Small (6-layer transformer) and GeneCompass-Base (12-layer transformer with 768-dimensional embeddings), allowing users to balance computational cost against performance.
  • Versatile fine-tuning: The pre-trained backbone can be adapted to cell-type annotation, gene regulatory network inference, gene perturbation prediction, drug dose-response modeling, and gene dosage sensitivity prediction.
  • Experimental validation: Candidate transcription factors identified by GeneCompass for gonadal lineage specification were experimentally confirmed to induce the target cell fate in human embryonic stem cells, closing the loop between computational prediction and biological validation.

Technical Details

GeneCompass is built on a transformer encoder architecture, available in two configurations: a 6-layer Small model and a 12-layer Base model with 768-dimensional hidden representations. Both variants use a self-attention mechanism that can capture long-range pairwise relationships between genes within a cellular context. Input cells are represented as ranked lists of up to 2,048 genes, ordered by expression level. Each gene token is augmented with embeddings from the four prior knowledge sources — GRN connections (derived from 84 mouse and 76 human cell-type-specific regulatory networks), promoter context vectors (from DNA-BERT applied to 2,500 bp windows around transcription start sites), gene family membership (1,645 human and 1,539 mouse families from HGNC), and co-expression associations — before entering the transformer stack. A species token is prepended to each input cell to condition the model on the organism of origin.

Pre-training used a masked gene modeling objective on the scCompass-126M corpus, which aggregated data from GEO, ArrayExpress, CNCB, and CELLxGENE. On downstream benchmarks, GeneCompass-Base improved macro-F1 for cell-type annotation by 3–8% over Geneformer on human datasets and 10–19% on mouse datasets. For gene perturbation prediction, the model achieved a 15.4% reduction in MSE and a 2.2% improvement in Spearman correlation for top-20 differentially expressed genes relative to baseline models. On gene dosage sensitivity prediction it reached an AUC of 0.95, and on GRN inference it outperformed scGPT and DeepSEM by AUPRC.

Applications

GeneCompass is intended for researchers working in single-cell genomics who need to characterize transcriptional regulation, annotate cell types, or predict the effects of genetic perturbations. Cell-type annotation workflows benefit from the model's cross-species representations, which enable transfer of annotations from well-studied organisms to less-characterized ones. Developmental biologists can use the GRN inference and in silico perturbation modules to generate ranked lists of candidate transcription factors for driving cell fate transitions, as demonstrated by the gonadal lineage experiment. Pharmacogenomics groups can apply the drug dose-response module to predict cellular responses to compound treatment at the transcriptomic level. The model is compatible with the GEARS framework for perturbation prediction, broadening integration with existing computational pipelines.

Impact

GeneCompass contributes to a growing body of work demonstrating that foundation models pre-trained on large single-cell corpora can generalize across diverse biological tasks. By combining scale (101M cells) with explicit biological prior knowledge, the model offers an alternative design philosophy to purely data-driven approaches like scGPT and Geneformer: biological knowledge and scale are complementary, not mutually exclusive. The experimental validation of predicted transcription factors in a living cell system is an important proof-of-concept showing that model outputs can guide real experimental decisions. An honest limitation is that GeneCompass, like all current single-cell foundation models, focuses on transcriptomics and does not directly model epigenomics, spatial context, or protein-level phenotypes. The cross-species design is currently restricted to human and mouse; extension to other organisms would require additional data collection and homolog mapping. Code and pre-trained checkpoints for both Small and Base variants are publicly available through the xCompass-AI GitHub organization.

Citation

GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

Yang, X., et al. (2023) GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. bioRxiv.

DOI: 10.1038/s41422-024-01034-y

Metrics

GitHub

Stars115
Forks23
Open Issues20
Contributors3
Last Push2mo ago
LanguageJupyter Notebook

Citations

Total Citations116
Influential11
References75

Tags

regulatory genomicsfoundation modelcross-speciestranscriptomics

Resources

GitHub RepositoryResearch Paper