A transformer that infers whole-genome DNA methylation landscapes from gene expression, generalizing zero-shot to unmeasured CpG sites and unseen samples.
DNA methylation (DNAm) is a central epigenetic mark: the addition of methyl groups at CpG dinucleotides helps establish and maintain gene-expression programs, cell identity, and disease states. Measuring it genome-wide — by whole-genome bisulfite sequencing or large arrays — is informative but expensive, and many datasets capture only a subset of CpG sites or only some samples. Gene expression, by contrast, is measured ubiquitously. This raises a natural question: how much of the methylation landscape can be inferred from expression alone?
MethylProphet, developed by Huang and colleagues at Columbia University (preprint first posted February 2025, updated February 2026 on bioRxiv), is a transformer-based model that predicts whole-genome DNA methylation from gene-expression input. Framed as a generalized, gene-contextual model, it learns relationships between expression and methylation that let it impute methylation at genomic positions that were not directly measured, and to generalize to biological samples it has not seen during training.
By learning a shared expression-to-methylation mapping rather than memorizing per-site behavior, MethylProphet aims to fill in unmeasured CpGs and extend methylation profiling to samples where only expression is available — a potentially large practical saving for studies that already generate transcriptomic data.
MethylProphet is a transformer trained on large public resources — ENCODE and TCGA — to map gene expression to DNA methylation across the genome. The authors describe training over a very large collection of CpG-by-sample pairs (on the order of 1.6 billion), giving the model broad coverage of expression–methylation relationships. Its gene-contextual design allows it to infer methylation at unmeasured CpG sites and to generalize to previously unseen samples in a zero-shot fashion. As a preprint (v2, February 2026), exact architectural details such as parameter count and context length, along with code and trained weights, are not yet publicly released; reported capabilities therefore await the full release and independent benchmarking.
MethylProphet is aimed at epigenomics and cancer-genomics researchers who have abundant transcriptomic data but limited or partial methylation measurements. It can impute missing CpG values to complete sparse methylation arrays, extend methylation profiling to samples where only RNA-seq was collected, and support studies of how expression and methylation co-vary across tissues and tumors. By reducing the need to assay every CpG directly, it could lower the cost of epigenome-scale analyses in large cohorts such as TCGA-style cancer studies.
MethylProphet tests how far a single learned model can reconstruct the methylation landscape from expression, positioning gene expression as a partial proxy for the epigenome. If its zero-shot imputation holds up under peer review, it could make genome-wide methylation estimates accessible for the many datasets that include transcriptomics but not comprehensive bisulfite sequencing. As a bioRxiv preprint without released code or weights, its results require independent validation, but the scale of training and the focus on cross-site and cross-sample generalization make it a notable entry in epigenomic foundation modeling.