Shandong University
Multi-scale deep biological language model for interpretable prediction of three DNA methylation types (4mC, 5hmC, 6mA) across multiple species using adaptive multi-scale k-mer BERT encoders.
iDNA-ABF (Interpretable DNA Adaptive Binary Feature) is a multi-scale deep biological language learning model for predicting DNA methylation sites, developed by Junru Jin, Leyi Wei, and colleagues at the School of Software and the Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR) at Shandong University, Jinan, China, in collaboration with researchers from Tianjin University and the University of Tokyo. It was published in Genome Biology in October 2022. The model targets three distinct DNA methylation modifications simultaneously: N4-methylcytosine (4mC), 5-hydroxymethylcytosine (5hmC), and N6-methyladenine (6mA) — modifications that play distinct regulatory roles in organisms ranging from bacteria and fungi to plants and mammals.
The central challenge iDNA-ABF addresses is the development of a single generic predictor capable of accurately identifying all three methylation types across multiple species, rather than requiring separate specialized tools for each modification. Prior methods were either designed for a single methylation type (e.g., Deep6mA for 6mA only) or were generic but lacked interpretability or multi-scale representation. iDNA-ABF introduces an adaptive multi-scale architecture that processes input sequences at two k-mer granularities (3-mer and 6-mer) using separate BERT encoders and then fuses the resulting representations, allowing the model to capture both fine-grained local sequence motifs and broader sequence context simultaneously. Adversarial training during the fine-tuning phase further improves the robustness of predictions against input perturbations.
The model was benchmarked across 17 training/testing dataset combinations spanning different methylation types and species, and was compared to four state-of-the-art predictors: iDNA-ABT, iDNA-MS, BERT6mA, and Deep6mA. iDNA-ABF demonstrated superior or competitive performance in the majority of comparisons, establishing it as a broadly applicable tool for DNA methylation site prediction.
iDNA-ABF's computational workflow consists of four modules. The multi-scale data processing module takes a 41-nucleotide input window centered on the candidate modification site and tokenizes it independently into overlapping 3-mer and 6-mer sequences, generating two separate token sequences of different lengths. Each tokenized sequence is then independently processed by a BERT encoder module — both encoders share the same architectural template (transformer layers with multi-head attention) but have different vocabularies corresponding to the 3-mer and 6-mer token sets. After encoding, the feature fusion module concatenates the pooled representations from both BERT encoders into a unified feature vector. The classification module applies a multi-layer perceptron with softmax activation to produce the final methylation probability.
The model was trained and evaluated on a benchmark dataset derived from iDNA-MS, comprising 17 independent training/testing splits covering 4mC predictions in multiple organisms (including A. thaliana, C. elegans, D. melanogaster), 5hmC predictions in mouse and human, and 6mA predictions across several species including Arabidopsis, Drosophila, mouse, and rice. Comparison across these 17 tasks against iDNA-ABT, iDNA-MS, BERT6mA, and Deep6mA showed that iDNA-ABF achieved the highest performance in the majority of settings. Interpretability analysis of 3-mer attention patterns revealed position-specific sequence preferences consistent with known methyltransferase recognition sequences, supporting the model's claim of biologically meaningful feature extraction.
iDNA-ABF is most useful in comparative epigenomics studies where researchers need to survey 6mA, 4mC, or 5hmC distributions across multiple organisms or genomic contexts without maintaining separate prediction tools. Plant biologists, insect biologists, and microbiologists working with organisms lacking well-curated experimental methylation datasets benefit most from the model's cross-species generalization. The unified multi-type prediction capability also simplifies workflows in studies that need to simultaneously profile all three methylation marks — for example, in organisms where 6mA and 4mC coexist. The interpretability module supports mechanistic studies of methyltransferase sequence specificity, generating testable predictions about which sequence motifs are recognized by specific methylation enzymes. The adversarial training also makes the model well-suited to comparative analysis of closely related sequences where minor variants may affect methylation status.
iDNA-ABF advanced the state of the art in multi-type DNA methylation prediction by introducing adaptive multi-scale encoding within a pre-trained BERT framework, demonstrating that a single model architecture could match or exceed the performance of specialized single-modification predictors across diverse organisms. Its publication in Genome Biology signaled growing acceptance of biological language models for fine-grained epigenomic annotation tasks beyond protein sequence analysis. The adversarial training strategy, less commonly applied in epigenomics compared to protein or drug discovery domains, contributed a useful robustness technique to the field. A limitation is that model performance depends on the quality and completeness of the underlying methylation training datasets, which vary substantially across species; for understudied organisms, imprecise experimental labels can propagate into prediction errors regardless of model architecture.