bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

MethylAI

Guangzhou National Laboratory

A cross-species-pretrained, human-specialized CNN that predicts single-CpG DNA methylation directly from genomic sequence and interprets the cis-regulatory motifs that govern it.

Released: November 2025

DNA methylation at CpG dinucleotides is a central epigenetic mark that helps govern transcription, genomic imprinting, and chromatin state, yet predicting it from sequence alone — and explaining why a given site is methylated — has remained difficult. Most methylation profiles are measured experimentally with whole-genome bisulfite sequencing (WGBS), an approach that is expensive and cannot be applied to hypothetical or patient-specific sequences. A sequence-to-methylation model that is both accurate and interpretable would let researchers read out the regulatory logic embedded in DNA and ask counterfactual questions that wet-lab assays cannot easily address.

MethylAI, developed by the Yu Lab at Guangzhou National Laboratory and released as a bioRxiv preprint in November 2025, is a deep learning framework that predicts single-CpG methylation states directly from genomic sequence with nucleotide-level interpretability. Its central design choice is cross-species pretraining: the model first learns generalizable sequence-to-methylation rules across a large multi-species compendium, then is fine-tuned to specialize on human data. This strategy transfers conserved regulatory grammar into the human-specific model and improves generalization beyond what CpG composition alone explains.

Beyond accurate prediction, MethylAI is built to dissect the determinants of methylation. Through quantitative attribution it surfaces conserved transcription factor (TF) motifs that shape methylation, supports in silico perturbation experiments to test causal hypotheses, and links noncoding genetic variants to the methylation-active motifs they disrupt — positioning the model as a tool for interpreting the regulatory architecture of the human epigenome rather than only forecasting it.

#Key Features

  • Cross-species pretraining, human specialization: MethylAI is pretrained on a large multi-species methylation compendium (12 species, including human, mouse, rat, macaque, chimpanzee, gorilla, cow, sheep, dog, pig, and giant panda) and then fine-tuned on human data, transferring conserved regulatory grammar into the human model.
  • Single-CpG resolution from sequence alone: It predicts methylation at individual CpG sites directly from DNA, requiring no experimental assay at inference time and enabling predictions for arbitrary or hypothetical sequences.
  • Interpretable motif attribution: DeepSHAP-based attribution reveals the cis-regulatory TF motifs driving methylation, exposing conserved signatures whose influence extends beyond local CpG content.
  • In silico perturbation as a zero-shot use case: Running from a fixed pretrained checkpoint, MethylAI simulates sequence perturbations; a CTCF knockdown experiment validated predicted methylation shifts at activated motifs.
  • Noncoding variant interpretation: The model connects genetic variants to methylation-linked active motifs, offering a route to prioritize and interpret noncoding GWAS and mQTL variants.

#Technical Details

MethylAI is a multi-scale convolutional neural network with exponential activations chosen to sharpen learned motif representations. The input block uses parallel convolutions with kernel sizes of 3, 9, and 21, followed by six multi-scale CNN blocks of increasing width (300 to 600 channels) and a configurable multi-task output head; the model ingests an 18,432 bp (9 × 2^11) DNA context per prediction. Training draws on the largest cross-species methylation compendium assembled to date — roughly 1,900 single-CpG-resolved WGBS methylomes across 12 species — with the human fine-tuning set comprising 1,574 WGBS samples spanning 52 tissues and 238 cell types. Reported evaluation uses Pearson and Spearman correlation between predicted and measured methylation, and on mQTL benchmarks MethylAI predicts the direction of variant-induced methylation change with greater than 87% accuracy for variants located within active motif sites.

#Applications

MethylAI suits epigenomics and regulatory-genomics researchers who need methylation estimates where direct measurement is impractical, including for variants or engineered sequences. It supports genome-wide methylation prediction across human tissues and cell types, identification of methylation-linked TF binding motifs, in silico perturbation to probe causal regulatory hypotheses, and interpretation of how noncoding GWAS and mQTL variants reshape the epigenetic landscape. Pretrained checkpoints and a sample/metadata portal are provided, so groups can apply the model to their own sequences without retraining.

#Impact

By coupling accurate single-CpG prediction with quantitative attribution, MethylAI advances the broader effort — alongside models such as CpGPT, MethylGPT, and DeepCpG — to treat DNA methylation as a learnable, interpretable function of sequence. Its cross-species pretraining offers a concrete recipe for transferring conserved regulatory signal into human-specialized epigenomic models, and its variant-to-motif linkage adds an interpretive layer for noncoding variant analysis. As a 2025 preprint the work awaits peer review and independent benchmarking, and as a CNN over a fixed-length context it does not capture very long-range interactions the way large transformer genomic models aim to; nonetheless, the public code (MIT-licensed), pretrained weights, and harmonized compendium lower the barrier for adoption and follow-on study.

Citation

Dissecting sequence determinants of DNA methylation and in silico perturbation

Preprint

Chen, F., et al. (2025) Dissecting sequence determinants of DNA methylation and in silico perturbation. bioRxiv.

DOI: 10.1101/2025.11.20.689274

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References8

GitHub

Stars6
Forks0
Open Issues0
Contributors1
Last Push4mo ago
LanguagePython
LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
64Partial
Usability — can I run it?71
Reproducibility — can I retrain it?60
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

cnndnaepigeneticsmethylation_predictionmotif_discoveryself_supervisedtransfer_learningvariant_effect_predictionzero_shot

Resources

GitHub RepositoryResearch PaperOfficial Website