bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

INTERACT

Lieber Institute for Brain Development

Deep learning model combining CNN and transformer layers to predict DNA methylation regulatory variants in the human brain, enabling fine-mapping of psychiatric disorder risk loci.

Released: 2022

Overview

INTERACT (Integrative Neural Network for Epigenetic Regulatory Analysis of Chromatin and Transcription) is a deep learning model that predicts the effect of genetic variation on DNA methylation levels at individual CpG sites in the human brain. Developed by Jiyun Zhou, Qiang Chen, Patricia R. Braun, and colleagues primarily at the Lieber Institute for Brain Development at the Johns Hopkins Medical Campus, it was published in Proceedings of the National Academy of Sciences (PNAS) in August 2022. The model addresses a fundamental challenge in human neuroscience genetics: identifying which DNA sequence variants functionally regulate the epigenome, and how these regulatory effects contribute to risk for psychiatric disorders such as schizophrenia.

DNA methylation in the brain is highly dynamic, cell-type-specific, and influenced by genetic variation at nearby and distal sequence positions. Methylation quantitative trait loci (mQTLs) — genetic variants that statistically associate with changes in methylation at specific CpG sites — provide experimental evidence for such regulatory effects, but mQTL studies are limited by sample size, linkage disequilibrium, and restricted coverage of CpG sites on standard arrays. INTERACT addresses these limitations by learning to predict CpG methylation levels from DNA sequence using a hybrid CNN-transformer architecture trained on hippocampal whole-genome bisulfite sequencing, and then using the trained model to score the predicted impact of any sequence variant on methylation at its surrounding CpG sites. This yields a quantitative "methylation regulatory variant" (MRV) score for every variant in the genome, independent of statistical association data.

A key application demonstrated in the PNAS paper is the use of INTERACT-derived MRV scores for fine-mapping schizophrenia GWAS risk loci. By overlaying predicted regulatory effects on genetic association signals, the researchers prioritized candidate causal variants and identified potential novel risk genes for schizophrenia, demonstrating the utility of epigenomic deep learning for translational neurogenetics. An extended version of the model, Cell-type-INTERACT, was subsequently developed to capture cell-type-specific methylation regulatory effects in the brain, and is available on a separate GitHub repository.

Key Features

  • Hybrid CNN-transformer architecture: INTERACT integrates a convolutional neural network module — which captures local DNA sequence motifs such as transcription factor binding sites and CpG context signals — with a transformer module that uses self-attention to capture long-range sequence interactions. Fully connected layers then predict CpG methylation levels from the combined representations.
  • 2 kbp input window centered on CpG sites: The model accepts 2-kilobase DNA sequence windows centered on the CpG site of interest, providing sufficient context to capture proximal regulatory elements including nearby transcription factor binding motifs and co-methylated regions, while remaining computationally efficient.
  • Variant effect scoring by in silico mutagenesis: Regulatory variant scores are computed by comparing INTERACT's methylation predictions for the reference allele and alternate allele at each variant position, generating a continuous score for the predicted methylation-altering effect of any nucleotide change without requiring matched genotype and methylation data.
  • Brain-specific training: The model was pre-trained on approximately 26 million CpG sites from whole-genome bisulfite sequencing (WGBS) of hippocampal tissue and fine-tuned on EPIC array methylation data from brain, blood, saliva, and buccal tissue across 21 donors, ensuring that predictions capture the tissue-specific epigenomic context of the brain.
  • Transcription factor motif discovery: Post-hoc analysis of the model's convolutional filters revealed 37 transcription factors whose DNA binding motifs match INTERACT's learned sequence features in a brain-specific manner, including ZIC2, NR2F1, PPARD, and RXRG — factors with known roles in neurodevelopment and neuropsychiatric disorders.
  • GWAS fine-mapping integration: Predicted MRV scores can be directly overlaid on GWAS summary statistics to perform epigenomic fine-mapping of risk loci. Applied to schizophrenia, this approach identified 124 risk loci where INTERACT predictions prioritized specific candidate regulatory variants and novel target genes.

Technical Details

INTERACT's architecture consists of three sequential modules. The first module is a CNN encoder consisting of multiple convolutional layers with increasing filter sizes, designed to hierarchically extract local sequence features from the 2 kbp input window. The second module is a transformer encoder with multi-head self-attention, which processes the CNN output sequence to capture inter-element dependencies across the local regulatory landscape. The third module consists of fully connected layers that map the combined representation to predicted CpG methylation beta values (continuous values between 0 and 1).

Training proceeded in two phases. Pre-training used WGBS data from hippocampal tissue covering approximately 26 million CpG sites, training the model to predict measured methylation levels from local sequence alone. Fine-tuning used EPIC array methylation data from 21 subjects across four tissue types (brain, blood, saliva, buccal), which provided training signal at the ~850,000 CpG positions covered by the EPIC array. After training, variant effect scores were generated for all common genetic variants across the genome by running the model twice per variant — once with the reference sequence and once with the alternate allele substituted — and taking the absolute difference in predicted methylation. These scores were validated against experimental mQTL evidence, showing strong enrichment of high-scoring INTERACT variants among known mQTL hits in brain tissue. Polygenic risk score analyses for schizophrenia showed significant improvement in risk prediction when incorporating INTERACT-derived MRV scores as annotations, particularly in non-European ancestry samples.

Applications

INTERACT's primary application domain is psychiatric and neurological genetics, where researchers need to identify which non-coding GWAS risk variants functionally regulate gene expression or chromatin state in brain tissue. The model enables researchers to prioritize variants for experimental validation among the many candidates in LD blocks that contain GWAS signals. It is also applicable in studies of neurodevelopmental conditions, Alzheimer's disease, bipolar disorder, and other brain-related traits where genome-wide association studies have identified non-coding risk loci. Beyond psychiatric genetics, INTERACT's predicted MRV scores provide a resource for understanding the regulatory grammar of DNA methylation in the brain, identifying which sequence motifs and transcription factor binding sites most strongly influence local CpG methylation. The model's approach generalizes to any tissue where matched WGBS and genotype data are available for training.

Impact

INTERACT demonstrated that deep learning models trained on epigenomic data could generate quantitative regulatory variant scores that both validate against experimental mQTL evidence and improve the resolution of GWAS fine-mapping for psychiatric traits. By explicitly integrating sequence-based predictions with genetic association data, the work established a template for epigenomic deep learning to contribute to causal variant identification — a longstanding bottleneck in human genetics. The PNAS publication attracted attention from psychiatric genetics and functional genomics communities, and the follow-on Cell-type-INTERACT model extended the approach to cell-type-specific methylation prediction. A key limitation is that INTERACT, like all sequence-to-methylation models, predicts the genetically regulated component of methylation variation and cannot account for environmental or stochastic contributions. The model's dependence on brain tissue training data also limits its performance in non-brain tissues without retraining, though the multi-tissue fine-tuning strategy partially addresses this.

Tags

epigenomic predictionvariant effect predictiontransformertransfer learningdeep learningDNA methylationepigenomics

Resources

GitHub RepositoryResearch Paper