bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

CpG Transformer

Ghent University

Transformer model for imputing single-cell DNA methylation from sparse bisulfite sequencing data, combining axial attention with sliding window self-attention for scalable CpG-level imputation.

Released: 2022

Overview

CpG Transformer is a transformer-based model developed by Gaetan De Waele, Jim Clauwaert, Gerben Menschaert, and Willem Waegeman at Ghent University's Department of Data Analysis and Mathematical Modelling, published in Bioinformatics in January 2022 (volume 38, issue 3, pages 597–603). It addresses a pervasive technical limitation in single-cell DNA methylation sequencing: bisulfite sequencing protocols — both single-cell bisulfite sequencing (scBS-seq) and single-cell reduced representation bisulfite sequencing (scRRBS-seq) — typically cover only a small fraction of CpG sites in each cell, leaving the large majority of positions unmeasured. This sparsity renders standard downstream analyses such as clustering, trajectory inference, and association studies unreliable, motivating the development of robust imputation methods.

The core challenge in methylation imputation is simultaneously modeling two dimensions of structure: the genomic dimension (correlation between nearby CpG sites, often mediated by chromatin domains and DNA sequence context) and the cell dimension (correlation between cells with similar methylation programs). Earlier methods based on RNNs or gradient boosted trees either handled one dimension at a time or required substantial parameter overhead for multi-cell modeling. CpG Transformer addresses this by introducing an adapted transformer architecture that treats the methylation data as a two-dimensional matrix — cells as rows, CpG sites as columns — and applies separate attention mechanisms along each axis in a computationally efficient way.

Inspired by masked language modeling from BERT, the model is trained by randomly masking observed methylation states and learning to predict them from context, making it self-supervised and applicable across datasets without requiring external labels. The resulting model demonstrates state-of-the-art imputation accuracy across multiple benchmarks and exhibits rapid transfer learning: weights learned on one dataset can be fine-tuned on a new dataset approximately 20 times faster than training from random initialization, substantially reducing the computational cost of applying the model to new experiments.

Key Features

  • Axial attention for methylation matrices: CpG Transformer combines column-wise self-attention (operating across CpG sites within a cell, capturing genomic correlations) with row-wise sliding window self-attention (operating across cells within a local CpG window, capturing cellular correlations). This axial design reduces the computational complexity from O(n²m²) — required by full 2D attention — to O(mn(n+w)), where w is the sliding window size (set to 41), making the approach tractable for genome-scale data.
  • Three-component input embedding: The model constructs a joint representation from CpG embeddings (encoding the methylation state: unknown, unmethylated, or methylated), cell embeddings (row-wise, encoding cell identity), and DNA sequence embeddings (column-wise, derived from a CNN processing the 1001-nucleotide sequence context around each CpG site). This integration allows the model to use both epigenomic and sequence information.
  • Self-supervised training via masked methylation modeling: Following the BERT masked language modeling paradigm, 80% of observed CpG states are masked and 20% are replaced with random states during training. The model learns to reconstruct the masked values from genomic and cellular context, enabling training without external labels.
  • Rapid transfer learning: CpG Transformer's learned cell and CpG embeddings generalize effectively across datasets. Fine-tuning a pre-trained model on a new scBS-seq or scRRBS-seq dataset converges approximately 20 times faster than training from random initialization, with comparable or better final performance.
  • Gradient-based interpretability: The model supports saliency-map-based feature attribution, allowing researchers to identify which neighboring CpG sites and which cells contribute most to the imputation of a given position, providing a mechanism to audit model behavior on biologically important loci.
  • Multi-dataset benchmark evaluation: The model was systematically evaluated on five single-cell methylation datasets covering diverse biological contexts, including serum-grown embryonic stem cells, hepatocellular carcinoma cells, and hematopoietic cells, demonstrating consistent improvements over competing methods.

Technical Details

CpG Transformer is implemented as a stack of four identical transformer layers. Each layer contains two attention sublayers: a column-wise self-attention block with 8 heads and 8 hidden dimensions, and a row-wise sliding window self-attention block with relative positional encodings. The sliding window of width 41 ensures that row-wise attention focuses on nearby CpG sites where methylation correlations are strongest, rather than attending globally across the full genome, which would be computationally prohibitive. A standard feedforward sublayer with ReLU activation follows each attention pair. The DNA sequence embedding module is a separate 1D convolutional network that processes 1001-nucleotide sequence windows and projects them to the same embedding dimension as the CpG and cell embeddings.

Benchmark results on five single-cell methylation datasets showed substantial improvements over the two main competing methods (DeepCpG and CaMelia). On the serum ES cell (Ser) dataset, CpG Transformer achieved ROC AUC of 91.55% and PR AUC of 93.87%. On the hepatocellular carcinoma (HCC) dataset it achieved ROC AUC of 97.96% and PR AUC of 95.19%. On the hematopoietic (Hemato) dataset it achieved ROC AUC of 90.65% and PR AUC of 96.43%. Transfer learning experiments demonstrated that initializing from a model trained on a different dataset reduced training time to convergence by approximately 20-fold, while reaching equivalent or better final imputation accuracy.

Applications

CpG Transformer's primary application is imputing missing methylation values in single-cell bisulfite sequencing experiments, enabling downstream analyses that require dense genome-wide methylation profiles. Researchers performing single-cell epigenomic clustering, pseudotime trajectory analysis, or identifying differentially methylated regions benefit directly from imputed matrices that more completely represent the cell-to-cell epigenomic landscape. The model is also applicable in multi-omics integration workflows where sparse methylation data need to be aligned with RNA-seq or chromatin accessibility profiles. Its rapid transfer learning capability is particularly valuable in studies of rare cell populations where only limited experimental data can be generated, reducing the cost of fitting a reliable imputation model.

Impact

CpG Transformer demonstrated that the transformer architecture's two-dimensional attention capabilities, originally developed for natural language processing, could be directly adapted to the structure of biological measurement matrices where two axes carry biologically meaningful correlations. The work helped establish axial attention as a practical design pattern for epigenomics, influencing subsequent approaches to methylation and chromatin accessibility modeling. Its self-supervised training framework — which requires no external annotation — means the method scales naturally to new datasets without label curation overhead. A limitation is that imputation accuracy degrades in regions of very sparse coverage across cells, where neither the genomic nor the cellular context provides sufficient signal. The model also requires reasonably similar coverage profiles between pre-training and target datasets for transfer learning to be most effective.

Tags

epigenomic predictionDNA methylation imputationtransformerself-supervisedtransfer learningDNA methylationepigenomics

Resources

GitHub RepositoryResearch Paper