Guangzhou Medical University / Guangzhou National Laboratory
A foundation model using Mamba and cross-attention to capture genome-wide CpG methylation dependencies in single-cell whole-genome bisulfite sequencing data.
scDNAm-GPT is a foundation model for single-cell DNA methylation analysis, developed by researchers at Guangzhou Medical University and Guangzhou National Laboratory and first posted to bioRxiv in February 2025. It addresses a persistent gap in single-cell epigenomics: while transformer-based foundation models have transformed single-cell RNA sequencing analysis, single-cell whole-genome bisulfite sequencing (scWGBS) has lacked a comparable general-purpose model. The central challenge is scale—a single methylome contains millions of CpG sites, far exceeding the context length that standard transformers can process efficiently.
The model captures genome-wide CpG methylation dependencies across extremely long genomic sequences, enabling a single pretrained backbone to support multiple downstream tasks without task-specific retraining. Rather than treating methylation as a fixed feature matrix, scDNAm-GPT learns representations directly from the raw methylome, allowing it to generalize across tissues, species, and analysis goals.
scDNAm-GPT was trained on over one million single cells spanning 35 human and mouse tissues, making it one of the first broadly applicable foundation models purpose-built for the single-cell methylation modality. It is released as open source, positioning it alongside RNA-focused single-cell foundation models while extending the foundation-model paradigm into the epigenetic layer of cellular identity.
scDNAm-GPT pairs a Mamba selective state space model with cross-attention to model long-range dependencies among CpG sites efficiently. State space models scale near-linearly with sequence length, allowing the architecture to ingest sequences far longer than standard transformers can handle while retaining the ability to relate distant methylation events. Pretraining used scWGBS data from over one million single cells across 35 human and mouse tissues, and the authors report strong cell-type classification accuracy across human-body and brain cell types. The repository provides three model variants—a human/mouse brain model, a human body/mouse model, and a compact "small" model—distributed with their weights via Google Drive and tutorial notebooks demonstrating clustering, expression prediction, trajectory inference, and cfDNA deconvolution.
scDNAm-GPT supports researchers studying epigenetic regulation, cell-type identity, and development through single-cell methylation data. Its zero-shot gene expression prediction lets investigators link methylation states to transcriptional output without paired multi-omic measurements, while trajectory inference aids studies of differentiation and lineage commitment. The cell-free DNA deconvolution capability is particularly relevant to liquid-biopsy and non-invasive diagnostics, where estimating the tissue of origin of circulating methylation signals can inform cancer detection and monitoring.
By bringing the foundation-model paradigm to single-cell whole-genome bisulfite sequencing, scDNAm-GPT helps close a gap between the rapidly maturing ecosystem of RNA-based single-cell models and the comparatively underserved methylation modality. Its use of a state space backbone to handle genome-scale CpG context offers a practical template for modeling other ultra-long biological sequences. As a preprint with openly released code (MIT license) and pretrained weights, its long-term influence and benchmark standing remain to be established through peer review and independent evaluation, but it represents a notable early step toward general-purpose single-cell epigenomic models.
Liang, C., et al. (2025) scDNAm-GPT Captures Genome-wide CpG Dependencies in Single-cell DNA methylomes to Revolutionize Epigenetic Analysis. bioRxiv.
DOI: 10.1101/2025.02.19.638959Papers that recently cited this model.
Aymane Aghziel, M. A. Mahraz, H. Tairi, et al.
Briefings Bioinform. · Aug 2025
The most-cited papers that cite this model.
Aymane Aghziel, M. A. Mahraz, H. Tairi, et al.
Briefings Bioinform. · Aug 2025
Share of papers citing this model.