mEthAE

Chromosome-wise explainable autoencoder that compresses DNA methylation array data up to 400-fold while keeping CpG groupings interpretable.

Released: July 2023

mEthAE is an explainable chromosome-wise autoencoder for dimensionality reduction of DNA methylation array data, developed by Sonja Katz, Vitor A.P. Martins dos Santos, Edoardo Saccenti, and Gennady V. Roshchupkin, spanning affiliations at the Laboratory of Systems and Synthetic Biology at Wageningen University & Research and the Department of Radiology and Nuclear Medicine and Department of Epidemiology at Erasmus MC, Rotterdam. It was first posted as a bioRxiv preprint in July 2023. The model addresses one of the core analytical challenges in large-scale methylation studies: DNA methylation array data (such as the Illumina EPIC array or the older 450K array) can contain measurements at 450,000 to 900,000 CpG sites per sample, making conventional statistical and machine learning analyses computationally expensive, prone to multiple testing problems, and difficult to interpret.

The central contribution of mEthAE is achieving extreme dimensionality reduction — compressing approximately 300,000 CpG sites per chromosome into as few as 1,389 total latent features across all chromosomes — while preserving enough biological signal to enable accurate supervised prediction of phenotypic variables such as age and sex from the latent embedding. This approximately 400-fold reduction in data dimensionality is achieved through a densely connected autoencoder architecture trained chromosome by chromosome, which partitions the high-dimensional methylation input into manageable per-chromosome blocks and then compresses each block independently into a small latent representation.

Critically, mEthAE is not merely a compression tool: it is designed for interpretability. The authors developed a perturbation-based interpretability pipeline that identifies groups of CpG sites whose latent representations are most strongly coupled, both within local chromosomal neighborhoods and globally across the full embedding. These CpG groupings are validated against EWAS (epigenome-wide association study) findings, genomic location annotations, biological pathway databases, and correlation patterns, demonstrating that the autoencoder's learned representations capture biologically meaningful epigenomic structure.

Key Features

Chromosome-wise decomposition: Rather than training a single monolithic autoencoder on all CpG sites simultaneously (which would be computationally prohibitive and likely to produce poorly structured latent spaces), mEthAE trains independent autoencoders for each chromosome. Each chromosome's CpG sites are compressed into a small latent vector, and the per-chromosome latent vectors are concatenated to form the final global embedding of approximately 1,389 dimensions.
Up to 400-fold dimensionality reduction: Starting from roughly 300,000 CpG sites across the autosomal genome (using common-variable CpGs from EPIC array data), mEthAE compresses this to approximately 1,389 latent features while maintaining reconstruction accuracy. This compression ratio enables standard machine learning algorithms and statistical tests to be applied to the latent space without the computational burden of genome-wide methylation data.
Dense architecture with PReLU activation: Each per-chromosome autoencoder uses densely connected layers with two hidden layers flanking the bottleneck, PReLU (Parametric Rectified Linear Unit) activations in internal layers, and sigmoid activation at the output to respect the bounded [0, 1] range of methylation beta values. Batch normalization and dropout are applied for regularization.
Perturbation-based interpretability pipeline: mEthAE quantifies the importance of individual CpG sites and groups of CpGs by systematically perturbing their latent representations and measuring the effect on reconstruction accuracy and downstream prediction. This yields two levels of CpG groupings: global groups (CpGs most central to the embedding-wide information) and local groups (CpGs that interact non-linearly within chromosomal neighborhoods).
Phenotype prediction validation: The biological utility of the latent embedding is validated by training supervised models to predict age and sex directly from the compressed representation. High predictive accuracy for both outcomes demonstrates that the 400-fold compression preserves phenotypically relevant epigenomic variation.
EWAS enrichment analysis: CpG groups identified by the interpretability pipeline are compared to EWAS catalog entries, confirming that globally important CpGs are preferentially enriched for sites with known associations to age, disease, and environmental exposures, providing biological grounding for the model's learned representations.

Technical Details

Each per-chromosome autoencoder in mEthAE follows a symmetric encoder-decoder architecture. The encoder takes the full vector of CpG beta values for a given chromosome (ranging from a few hundred to tens of thousands of CpGs depending on the chromosome) and passes it through two densely connected hidden layers with PReLU activation before the bottleneck layer. The bottleneck dimension was optimized per chromosome to yield the target compression ratio. The decoder mirrors this structure, reconstructing the full beta value vector from the bottleneck representation using sigmoid activation at the output layer. Batch normalization is applied after each hidden layer, and dropout (rate 0.2) is used during training to prevent overfitting.

The perturbation-based interpretability pipeline operates on the trained encoder by systematically masking individual CpG values (replacing them with the dataset mean) and measuring the change in the corresponding latent neuron activations. CpGs that strongly influence a specific latent neuron are grouped with that neuron, forming local CpG groups. Global groups are identified by measuring which latent neurons most strongly influence reconstruction quality across all chromosomes when perturbed, then tracing which input CpGs contribute most to those neurons. Validation experiments showed that globally important CpGs are significantly more likely to appear in the EWAS catalog and are significantly more predictive of chronological age in holdout samples, while local groups show long-range, non-linear interaction patterns rather than simple spatial proximity on the chromosome.

Applications

mEthAE is applicable in any large-scale population epigenomics study where high-dimensional methylation array data need to be compressed for downstream analysis. Epigenome-wide association studies, epigenetic clock research, and multi-omics integration projects are the primary beneficiaries. Researchers working with cohort datasets measuring methylation across thousands of participants — such as UK Biobank methylation data or large longitudinal birth cohorts — can use mEthAE to produce compact per-sample embeddings that are directly compatible with genome-wide association analyses, phenotype prediction, or dimensionality-sensitive clustering methods. The interpretability pipeline also supports mechanistic research: by identifying which CpG groups are most predictive of a phenotype, researchers can generate targeted hypotheses about the regulatory regions or biological pathways underlying epigenomic disease associations.

Impact

mEthAE fills a distinct niche in the epigenomics toolkit by combining aggressive dimensionality reduction with a rigorous interpretability framework, addressing the common criticism that autoencoder-based compression is a "black box" that obscures biological meaning. The chromosome-wise decomposition strategy provides a principled way to partition the high-dimensional methylation input into tractable chunks, a practical design choice that improves both training stability and interpretability compared to a single monolithic encoder. As a bioRxiv preprint at the time of writing, formal peer review was still pending; independent replication of the EWAS enrichment findings and benchmarking against alternative compression approaches such as PCA or variational autoencoders would further establish the method's utility. The model is particularly well-timed for the expanding use of large-scale methylation arrays in biobank and epidemiological contexts, where the need for compact, interpretable epigenomic representations is growing rapidly.

Citation

mEthAE: an Explainable AutoEncoder for methylation data

Preprint

Katz, S., et al. (2024) mEthAE: an Explainable AutoEncoder for methylation data. bioRxiv.

DOI: 10.1101/2023.07.18.549496

Recent citations

Papers that recently cited this model.

Fast Fourier transform is a training-free, ultrafast, highly efficient, and fully interpretable approach for epigenomic data compression
Max Ward, Bac Dao, Amittava Datta, et al.
bioRxiv · Sep 2025
0
DNA methylation profile to aid in the diagnosis of pancreatic ductal adenocarcinoma and its role in disease progression
Elena Grafenhorst, Teodor G. Calina, M. Dragomir
Epigenomics · Sep 2025
4
Biomarker integration and biosensor technologies enabling AI-driven insights into biological aging
Jared A Kushner, Mohit Pandey, S. Kohli
Frontiers in Aging · Aug 2025
4

Top citations

The most-cited papers that cite this model.

DNA methylation profile to aid in the diagnosis of pancreatic ductal adenocarcinoma and its role in disease progression
Elena Grafenhorst, Teodor G. Calina, M. Dragomir
Epigenomics · Sep 2025
4
Biomarker integration and biosensor technologies enabling AI-driven insights into biological aging
Jared A Kushner, Mohit Pandey, S. Kohli
Frontiers in Aging · Aug 2025
4
Bridging the gap in precision medicine: TranSYS training programme for next-generation scientists
L. Andreoli, Catalina Berca, Sonja Katz, et al.
Frontiers in Medicine · May 2024
4
Fast Fourier transform is a training-free, ultrafast, highly efficient, and fully interpretable approach for epigenomic data compression
Max Ward, Bac Dao, Amittava Datta, et al.
bioRxiv · Sep 2025
0

Citations

Total Citations4

Influential0

References68

GitHub

Stars3

Forks1

Open Issues1

Contributors1

Last Push2y ago

LanguageJupyter Notebook

LicenseMIT

Fields of citing research

Medicine100%
Biology50%
Computer Science50%
Education25%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

47Partial

Usability — can I run it?51

Reproducibility — can I retrain it?47

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Dataset

Key Features

Chromosome-wise decomposition: Rather than training a single monolithic autoencoder on all CpG sites simultaneously (which would be computationally prohibitive and likely to produce poorly structured latent spaces), mEthAE trains independent autoencoders for each chromosome. Each chromosome's CpG sites are compressed into a small latent vector, and the per-chromosome latent vectors are concatenated to form the final global embedding of approximately 1,389 dimensions.

Up to 400-fold dimensionality reduction: Starting from roughly 300,000 CpG sites across the autosomal genome (using common-variable CpGs from EPIC array data), mEthAE compresses this to approximately 1,389 latent features while maintaining reconstruction accuracy. This compression ratio enables standard machine learning algorithms and statistical tests to be applied to the latent space without the computational burden of genome-wide methylation data.

Dense architecture with PReLU activation: Each per-chromosome autoencoder uses densely connected layers with two hidden layers flanking the bottleneck, PReLU (Parametric Rectified Linear Unit) activations in internal layers, and sigmoid activation at the output to respect the bounded [0, 1] range of methylation beta values. Batch normalization and dropout are applied for regularization.

Perturbation-based interpretability pipeline: mEthAE quantifies the importance of individual CpG sites and groups of CpGs by systematically perturbing their latent representations and measuring the effect on reconstruction accuracy and downstream prediction. This yields two levels of CpG groupings: global groups (CpGs most central to the embedding-wide information) and local groups (CpGs that interact non-linearly within chromosomal neighborhoods).

Phenotype prediction validation: The biological utility of the latent embedding is validated by training supervised models to predict age and sex directly from the compressed representation. High predictive accuracy for both outcomes demonstrates that the 400-fold compression preserves phenotypically relevant epigenomic variation.

EWAS enrichment analysis: CpG groups identified by the interpretability pipeline are compared to EWAS catalog entries, confirming that globally important CpGs are preferentially enriched for sites with known associations to age, disease, and environmental exposures, providing biological grounding for the model's learned representations.

Technical Details

Applications

Impact

mEthAE

#Key Features

#Technical Details

#Applications

#Impact

Citation

mEthAE: an Explainable AutoEncoder for methylation data

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

mEthAE

#Key Features

#Technical Details

#Applications

#Impact

Citation

mEthAE: an Explainable AutoEncoder for methylation data

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact