Kyushu Institute of Technology
BERT-based deep learning model for predicting DNA N6-methyladenine (6mA) modification sites across multiple species, using word2vec encoding and cross-species transfer learning.
BERT6mA is a deep learning model for predicting the locations of N6-methyladenine (6mA) DNA modification sites, developed by Sho Tsukiyama, Md Mehedi Hasan, Hong-Wen Deng, and Hiroyuki Kurata at the Department of Bioscience and Bioinformatics at Kyushu Institute of Technology (KIT) in Iizuka, Japan, and published in Briefings in Bioinformatics in March 2022. DNA N6-methyladenine is a modification of adenine bases in which a methyl group is added at the N6 position, found across a wide range of organisms from bacteria and simple eukaryotes to plants, insects, and — though controversially and at lower abundance — in vertebrates and mammals. The 6mA mark has been linked to important roles in DNA replication, DNA repair, transcriptional regulation, and gene expression control, and its dysregulation has been associated with developmental defects and disease phenotypes.
Identifying 6mA sites experimentally is costly and time-consuming, driving demand for accurate computational predictors. BERT6mA adapts the BERT (Bidirectional Encoder Representations from Transformers) architecture — pre-trained via masked language modeling on genomic sequence data — to the classification of potential 6mA sites from their local DNA sequence context. The model is evaluated across 11 species with highly variable data availability, from well-studied organisms with thousands of verified 6mA sites to less-studied species with fewer than a few hundred examples. A key contribution is a cross-species transfer learning strategy that substantially improves performance for data-limited species by pre-training on related organisms and fine-tuning on the target species.
A systematic comparison of multiple encoding approaches and model architectures — including classical one-hot and k-mer encodings combined with CNNs and LSTMs, and five BERT model variants — demonstrates that BERT combined with word2vec encoding achieves the highest area under the ROC curve (AUC) in 8 of 11 species tested, establishing BERT6mA as the recommended method for 6mA prediction across diverse organisms.
BERT6mA implements a three-layer BERT encoder with 4 attention heads and hidden dimension 100. The input sequence is a 41-nucleotide window centered on the candidate adenine site, tokenized into overlapping 4-mers (yielding 38 tokens) and embedded via pre-trained word2vec vectors of dimension 100, resulting in a 38 × 100 input matrix. The classification head consists of a linear layer with softmax activation applied to the pooled [CLS] token representation. All 7 encoding variants tested use the same 41-nucleotide window but differ in tokenization strategy and embedding dimensionality.
Performance on the independent test sets across 11 species showed BERT6mA with word2vec encoding achieving the highest AUCs in 8 species, outperforming earlier methods including iDNA-MS, iDNA-ABT, Deep6mA, and 6mA-Pred. For species with very small training sets (under 600 samples), the transfer learning approach — pretrained on a data-rich species, fine-tuned on the target — outperformed models trained from scratch on the limited target data alone, demonstrating meaningful cross-species generalization. Attention weight analysis revealed that BERT6mA learns sequence-level features preferentially at positions immediately flanking the central adenine, consistent with the known sequence context preferences of methyltransferases responsible for 6mA deposition.
BERT6mA is most directly applicable to researchers studying DNA N6-methyladenine in non-mammalian organisms, where 6mA is abundant and its functional roles in regulating transcription initiation, DNA replication timing, and epigenetic inheritance are well-established. Plant biologists studying Arabidopsis or crop species, insect biologists studying Drosophila or other hexapods, and parasitologists studying protozoan pathogens such as Chlamydomonas or Tetrahymena can use BERT6mA to generate genome-wide 6mA predictions as a complement to experimental approaches. The model is also applicable in comparative epigenomics studies that track 6mA distribution across phylogenetically diverse organisms. For mammalian researchers, where 6mA abundance is debated, the model provides a tool to computationally evaluate candidate sites predicted by 6mA-seq or other experimental methods.
BERT6mA demonstrated that BERT-style pre-trained language models could be adapted effectively to the domain of DNA modification site prediction with compact architectures (3 layers, 4 heads) rather than the large models typical of NLP applications, an important practical finding for computational biologists working with limited GPU resources. The cross-species transfer learning strategy addressed a concrete and common challenge in epigenomics: many biologically important species have insufficient experimental data for training reliable predictors from scratch. The web server deployment has extended accessibility to wet-lab researchers across diverse organism systems. A known limitation is that the 41-nucleotide input window captures only local sequence context around candidate sites, potentially missing longer-range regulatory features that influence methyltransferase accessibility; methods that incorporate broader genomic context may improve predictions, particularly in complex eukaryotic genomes where chromatin state strongly modulates methylation.