bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

MethylProphet

Columbia University

A transformer that infers whole-genome DNA methylation landscapes from gene expression, generalizing zero-shot to unmeasured CpG sites and unseen samples.

Released: February 2026

DNA methylation (DNAm) is a central epigenetic mark: the addition of methyl groups at CpG dinucleotides helps establish and maintain gene-expression programs, cell identity, and disease states. Measuring it genome-wide — by whole-genome bisulfite sequencing or large arrays — is informative but expensive, and many datasets capture only a subset of CpG sites or only some samples. Gene expression, by contrast, is measured ubiquitously. This raises a natural question: how much of the methylation landscape can be inferred from expression alone?

MethylProphet, developed by Huang and colleagues at Columbia University (preprint first posted February 2025, updated February 2026 on bioRxiv), is a transformer-based model that predicts whole-genome DNA methylation from gene-expression input. Framed as a generalized, gene-contextual model, it learns relationships between expression and methylation that let it impute methylation at genomic positions that were not directly measured, and to generalize to biological samples it has not seen during training.

By learning a shared expression-to-methylation mapping rather than memorizing per-site behavior, MethylProphet aims to fill in unmeasured CpGs and extend methylation profiling to samples where only expression is available — a potentially large practical saving for studies that already generate transcriptomic data.

#Key Features

  • Expression-to-methylation inference: Predicts genome-wide DNAm using gene-expression input, exploiting the regulatory link between transcription and methylation.
  • Gene-contextual modeling: A generalized, gene-aware formulation lets the model reason about methylation in the context of nearby genes rather than treating CpGs in isolation.
  • Zero-shot to unmeasured sites: Infers methylation at CpG positions not directly assayed, effectively densifying sparse measurements.
  • Generalization to unseen samples: Transfers to biological samples outside the training set, supporting broad applicability across tissues and conditions.
  • Trained at large scale: Learned from extensive ENCODE and TCGA datasets spanning many samples and CpG sites.

#Technical Details

MethylProphet is a transformer trained on large public resources — ENCODE and TCGA — to map gene expression to DNA methylation across the genome. The authors describe training over a very large collection of CpG-by-sample pairs (on the order of 1.6 billion), giving the model broad coverage of expression–methylation relationships. Its gene-contextual design allows it to infer methylation at unmeasured CpG sites and to generalize to previously unseen samples in a zero-shot fashion. As a preprint (v2, February 2026), exact architectural details such as parameter count and context length, along with code and trained weights, are not yet publicly released; reported capabilities therefore await the full release and independent benchmarking.

#Applications

MethylProphet is aimed at epigenomics and cancer-genomics researchers who have abundant transcriptomic data but limited or partial methylation measurements. It can impute missing CpG values to complete sparse methylation arrays, extend methylation profiling to samples where only RNA-seq was collected, and support studies of how expression and methylation co-vary across tissues and tumors. By reducing the need to assay every CpG directly, it could lower the cost of epigenome-scale analyses in large cohorts such as TCGA-style cancer studies.

#Impact

MethylProphet tests how far a single learned model can reconstruct the methylation landscape from expression, positioning gene expression as a partial proxy for the epigenome. If its zero-shot imputation holds up under peer review, it could make genome-wide methylation estimates accessible for the many datasets that include transcriptomics but not comprehensive bisulfite sequencing. As a bioRxiv preprint without released code or weights, its results require independent validation, but the scale of training and the focus on cross-site and cross-sample generalization make it a notable entry in epigenomic foundation modeling.

Tags

methylation_predictiongene_expressiontransformerfoundation_modelzero_shotdna_methylationepigenomics