A cross-species-pretrained, human-specialized CNN that predicts single-CpG DNA methylation directly from genomic sequence and interprets the cis-regulatory motifs that govern it.
DNA methylation at CpG dinucleotides is a central epigenetic mark that helps govern transcription, genomic imprinting, and chromatin state, yet predicting it from sequence alone — and explaining why a given site is methylated — has remained difficult. Most methylation profiles are measured experimentally with whole-genome bisulfite sequencing (WGBS), an approach that is expensive and cannot be applied to hypothetical or patient-specific sequences. A sequence-to-methylation model that is both accurate and interpretable would let researchers read out the regulatory logic embedded in DNA and ask counterfactual questions that wet-lab assays cannot easily address.
MethylAI, developed by the Yu Lab at Guangzhou National Laboratory and released as a bioRxiv preprint in November 2025, is a deep learning framework that predicts single-CpG methylation states directly from genomic sequence with nucleotide-level interpretability. Its central design choice is cross-species pretraining: the model first learns generalizable sequence-to-methylation rules across a large multi-species compendium, then is fine-tuned to specialize on human data. This strategy transfers conserved regulatory grammar into the human-specific model and improves generalization beyond what CpG composition alone explains.
Beyond accurate prediction, MethylAI is built to dissect the determinants of methylation. Through quantitative attribution it surfaces conserved transcription factor (TF) motifs that shape methylation, supports in silico perturbation experiments to test causal hypotheses, and links noncoding genetic variants to the methylation-active motifs they disrupt — positioning the model as a tool for interpreting the regulatory architecture of the human epigenome rather than only forecasting it.
MethylAI is a multi-scale convolutional neural network with exponential activations chosen to sharpen learned motif representations. The input block uses parallel convolutions with kernel sizes of 3, 9, and 21, followed by six multi-scale CNN blocks of increasing width (300 to 600 channels) and a configurable multi-task output head; the model ingests an 18,432 bp (9 × 2^11) DNA context per prediction. Training draws on the largest cross-species methylation compendium assembled to date — roughly 1,900 single-CpG-resolved WGBS methylomes across 12 species — with the human fine-tuning set comprising 1,574 WGBS samples spanning 52 tissues and 238 cell types. Reported evaluation uses Pearson and Spearman correlation between predicted and measured methylation, and on mQTL benchmarks MethylAI predicts the direction of variant-induced methylation change with greater than 87% accuracy for variants located within active motif sites.
MethylAI suits epigenomics and regulatory-genomics researchers who need methylation estimates where direct measurement is impractical, including for variants or engineered sequences. It supports genome-wide methylation prediction across human tissues and cell types, identification of methylation-linked TF binding motifs, in silico perturbation to probe causal regulatory hypotheses, and interpretation of how noncoding GWAS and mQTL variants reshape the epigenetic landscape. Pretrained checkpoints and a sample/metadata portal are provided, so groups can apply the model to their own sequences without retraining.
By coupling accurate single-CpG prediction with quantitative attribution, MethylAI advances the broader effort — alongside models such as CpGPT, MethylGPT, and DeepCpG — to treat DNA methylation as a learnable, interpretable function of sequence. Its cross-species pretraining offers a concrete recipe for transferring conserved regulatory signal into human-specialized epigenomic models, and its variant-to-motif linkage adds an interpretive layer for noncoding variant analysis. As a 2025 preprint the work awaits peer review and independent benchmarking, and as a CNN over a fixed-length context it does not capture very long-range interactions the way large transformer genomic models aim to; nonetheless, the public code (MIT-licensed), pretrained weights, and harmonized compendium lower the barrier for adoption and follow-on study.
Chen, F., et al. (2025) Dissecting sequence determinants of DNA methylation and in silico perturbation. bioRxiv.
DOI: 10.1101/2025.11.20.689274Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data