Lieber Institute for Brain Development
Deep learning model combining CNN and transformer layers to predict DNA methylation regulatory variants in the human brain, enabling fine-mapping of psychiatric disorder risk loci.
INTERACT (Integrative Neural Network for Epigenetic Regulatory Analysis of Chromatin and Transcription) is a deep learning model that predicts the effect of genetic variation on DNA methylation levels at individual CpG sites in the human brain. Developed by Jiyun Zhou, Qiang Chen, Patricia R. Braun, and colleagues primarily at the Lieber Institute for Brain Development at the Johns Hopkins Medical Campus, it was published in Proceedings of the National Academy of Sciences (PNAS) in August 2022. The model addresses a fundamental challenge in human neuroscience genetics: identifying which DNA sequence variants functionally regulate the epigenome, and how these regulatory effects contribute to risk for psychiatric disorders such as schizophrenia.
DNA methylation in the brain is highly dynamic, cell-type-specific, and influenced by genetic variation at nearby and distal sequence positions. Methylation quantitative trait loci (mQTLs) — genetic variants that statistically associate with changes in methylation at specific CpG sites — provide experimental evidence for such regulatory effects, but mQTL studies are limited by sample size, linkage disequilibrium, and restricted coverage of CpG sites on standard arrays. INTERACT addresses these limitations by learning to predict CpG methylation levels from DNA sequence using a hybrid CNN-transformer architecture trained on hippocampal whole-genome bisulfite sequencing, and then using the trained model to score the predicted impact of any sequence variant on methylation at its surrounding CpG sites. This yields a quantitative "methylation regulatory variant" (MRV) score for every variant in the genome, independent of statistical association data.
A key application demonstrated in the PNAS paper is the use of INTERACT-derived MRV scores for fine-mapping schizophrenia GWAS risk loci. By overlaying predicted regulatory effects on genetic association signals, the researchers prioritized candidate causal variants and identified potential novel risk genes for schizophrenia, demonstrating the utility of epigenomic deep learning for translational neurogenetics. An extended version of the model, Cell-type-INTERACT, was subsequently developed to capture cell-type-specific methylation regulatory effects in the brain, and is available on a separate GitHub repository.
INTERACT's architecture consists of three sequential modules. The first module is a CNN encoder consisting of multiple convolutional layers with increasing filter sizes, designed to hierarchically extract local sequence features from the 2 kbp input window. The second module is a transformer encoder with multi-head self-attention, which processes the CNN output sequence to capture inter-element dependencies across the local regulatory landscape. The third module consists of fully connected layers that map the combined representation to predicted CpG methylation beta values (continuous values between 0 and 1).
Training proceeded in two phases. Pre-training used WGBS data from hippocampal tissue covering approximately 26 million CpG sites, training the model to predict measured methylation levels from local sequence alone. Fine-tuning used EPIC array methylation data from 21 subjects across four tissue types (brain, blood, saliva, buccal), which provided training signal at the ~850,000 CpG positions covered by the EPIC array. After training, variant effect scores were generated for all common genetic variants across the genome by running the model twice per variant — once with the reference sequence and once with the alternate allele substituted — and taking the absolute difference in predicted methylation. These scores were validated against experimental mQTL evidence, showing strong enrichment of high-scoring INTERACT variants among known mQTL hits in brain tissue. Polygenic risk score analyses for schizophrenia showed significant improvement in risk prediction when incorporating INTERACT-derived MRV scores as annotations, particularly in non-European ancestry samples.
INTERACT's primary application domain is psychiatric and neurological genetics, where researchers need to identify which non-coding GWAS risk variants functionally regulate gene expression or chromatin state in brain tissue. The model enables researchers to prioritize variants for experimental validation among the many candidates in LD blocks that contain GWAS signals. It is also applicable in studies of neurodevelopmental conditions, Alzheimer's disease, bipolar disorder, and other brain-related traits where genome-wide association studies have identified non-coding risk loci. Beyond psychiatric genetics, INTERACT's predicted MRV scores provide a resource for understanding the regulatory grammar of DNA methylation in the brain, identifying which sequence motifs and transcription factor binding sites most strongly influence local CpG methylation. The model's approach generalizes to any tissue where matched WGBS and genotype data are available for training.
INTERACT demonstrated that deep learning models trained on epigenomic data could generate quantitative regulatory variant scores that both validate against experimental mQTL evidence and improve the resolution of GWAS fine-mapping for psychiatric traits. By explicitly integrating sequence-based predictions with genetic association data, the work established a template for epigenomic deep learning to contribute to causal variant identification — a longstanding bottleneck in human genetics. The PNAS publication attracted attention from psychiatric genetics and functional genomics communities, and the follow-on Cell-type-INTERACT model extended the approach to cell-type-specific methylation prediction. A key limitation is that INTERACT, like all sequence-to-methylation models, predicts the genetically regulated component of methylation variation and cannot account for environmental or stochastic contributions. The model's dependence on brain tissue training data also limits its performance in non-brain tissues without retraining, though the multi-tissue fine-tuning strategy partially addresses this.