Overview

BERT6mA is a deep learning model for predicting the locations of N6-methyladenine (6mA) DNA modification sites, developed by Sho Tsukiyama, Md Mehedi Hasan, Hong-Wen Deng, and Hiroyuki Kurata at the Department of Bioscience and Bioinformatics at Kyushu Institute of Technology (KIT) in Iizuka, Japan, and published in Briefings in Bioinformatics in March 2022. DNA N6-methyladenine is a modification of adenine bases in which a methyl group is added at the N6 position, found across a wide range of organisms from bacteria and simple eukaryotes to plants, insects, and — though controversially and at lower abundance — in vertebrates and mammals. The 6mA mark has been linked to important roles in DNA replication, DNA repair, transcriptional regulation, and gene expression control, and its dysregulation has been associated with developmental defects and disease phenotypes.

Identifying 6mA sites experimentally is costly and time-consuming, driving demand for accurate computational predictors. BERT6mA adapts the BERT (Bidirectional Encoder Representations from Transformers) architecture — pre-trained via masked language modeling on genomic sequence data — to the classification of potential 6mA sites from their local DNA sequence context. The model is evaluated across 11 species with highly variable data availability, from well-studied organisms with thousands of verified 6mA sites to less-studied species with fewer than a few hundred examples. A key contribution is a cross-species transfer learning strategy that substantially improves performance for data-limited species by pre-training on related organisms and fine-tuning on the target species.

A systematic comparison of multiple encoding approaches and model architectures — including classical one-hot and k-mer encodings combined with CNNs and LSTMs, and five BERT model variants — demonstrates that BERT combined with word2vec encoding achieves the highest area under the ROC curve (AUC) in 8 of 11 species tested, establishing BERT6mA as the recommended method for 6mA prediction across diverse organisms.

Key Features

BERT architecture for sequence context: BERT6mA uses a compact BERT model with 3 transformer layers and 4 attention heads with hidden size 100, trained via masked language modeling on genomic k-mer sequences. The bidirectional attention allows the model to integrate sequence context from both upstream and downstream of the candidate 6mA site simultaneously.
Word2vec nucleotide encoding: The model encodes each input sequence as a series of consecutive 4-mer tokens embedded into 100-dimensional vectors using a word2vec approach trained on genomic sequence data. This yields an input matrix of shape 38 × 100, representing the 41-nucleotide window centered on the candidate site (with the central adenine as the target).
Multi-encoding evaluation framework: In addition to word2vec-BERT, the authors benchmarked six alternative encoding strategies including NCPNF (nucleotide chemical properties and frequency), MNBE (mono-nucleotide binary encoding), and contextual variants of each. This systematic evaluation provides a comprehensive view of how encoding choice affects 6mA prediction performance.
Cross-species transfer learning: For species with small training datasets — such as Rosa chinensis and Arabidopsis thaliana — the model employs a pretraining-then-fine-tuning strategy: BERT6mA is first trained on a data-rich related species, then fine-tuned on the target species using 5-fold cross-validation. This approach substantially improves AUC for low-data species.
Eleven-species evaluation: The model was evaluated on 6mA datasets spanning 11 organisms including Arabidopsis thaliana, Drosophila melanogaster, Mus musculus, Homo sapiens, Caenorhabditis elegans, and several additional eukaryotes, providing one of the most comprehensive cross-species 6mA benchmarks published at the time.
Online prediction server: A web server hosted at Kyushu Institute of Technology allows users to submit DNA sequences and receive 6mA probability scores without requiring local installation, facilitating use by wet-lab researchers without computational infrastructure.

Technical Details

BERT6mA implements a three-layer BERT encoder with 4 attention heads and hidden dimension 100. The input sequence is a 41-nucleotide window centered on the candidate adenine site, tokenized into overlapping 4-mers (yielding 38 tokens) and embedded via pre-trained word2vec vectors of dimension 100, resulting in a 38 × 100 input matrix. The classification head consists of a linear layer with softmax activation applied to the pooled [CLS] token representation. All 7 encoding variants tested use the same 41-nucleotide window but differ in tokenization strategy and embedding dimensionality.

Performance on the independent test sets across 11 species showed BERT6mA with word2vec encoding achieving the highest AUCs in 8 species, outperforming earlier methods including iDNA-MS, iDNA-ABT, Deep6mA, and 6mA-Pred. For species with very small training sets (under 600 samples), the transfer learning approach — pretrained on a data-rich species, fine-tuned on the target — outperformed models trained from scratch on the limited target data alone, demonstrating meaningful cross-species generalization. Attention weight analysis revealed that BERT6mA learns sequence-level features preferentially at positions immediately flanking the central adenine, consistent with the known sequence context preferences of methyltransferases responsible for 6mA deposition.

Applications

BERT6mA is most directly applicable to researchers studying DNA N6-methyladenine in non-mammalian organisms, where 6mA is abundant and its functional roles in regulating transcription initiation, DNA replication timing, and epigenetic inheritance are well-established. Plant biologists studying Arabidopsis or crop species, insect biologists studying Drosophila or other hexapods, and parasitologists studying protozoan pathogens such as Chlamydomonas or Tetrahymena can use BERT6mA to generate genome-wide 6mA predictions as a complement to experimental approaches. The model is also applicable in comparative epigenomics studies that track 6mA distribution across phylogenetically diverse organisms. For mammalian researchers, where 6mA abundance is debated, the model provides a tool to computationally evaluate candidate sites predicted by 6mA-seq or other experimental methods.

Impact

BERT6mA demonstrated that BERT-style pre-trained language models could be adapted effectively to the domain of DNA modification site prediction with compact architectures (3 layers, 4 heads) rather than the large models typical of NLP applications, an important practical finding for computational biologists working with limited GPU resources. The cross-species transfer learning strategy addressed a concrete and common challenge in epigenomics: many biologically important species have insufficient experimental data for training reliable predictors from scratch. The web server deployment has extended accessibility to wet-lab researchers across diverse organism systems. A known limitation is that the 41-nucleotide input window captures only local sequence context around candidate sites, potentially missing longer-range regulatory features that influence methyltransferase accessibility; methods that incorporate broader genomic context may improve predictions, particularly in complex eukaryotic genomes where chromatin state strongly modulates methylation.

Overview

Key Features

BERT architecture for sequence context: BERT6mA uses a compact BERT model with 3 transformer layers and 4 attention heads with hidden size 100, trained via masked language modeling on genomic k-mer sequences. The bidirectional attention allows the model to integrate sequence context from both upstream and downstream of the candidate 6mA site simultaneously.

Word2vec nucleotide encoding: The model encodes each input sequence as a series of consecutive 4-mer tokens embedded into 100-dimensional vectors using a word2vec approach trained on genomic sequence data. This yields an input matrix of shape 38 × 100, representing the 41-nucleotide window centered on the candidate site (with the central adenine as the target).

Multi-encoding evaluation framework: In addition to word2vec-BERT, the authors benchmarked six alternative encoding strategies including NCPNF (nucleotide chemical properties and frequency), MNBE (mono-nucleotide binary encoding), and contextual variants of each. This systematic evaluation provides a comprehensive view of how encoding choice affects 6mA prediction performance.

Cross-species transfer learning: For species with small training datasets — such as Rosa chinensis and Arabidopsis thaliana — the model employs a pretraining-then-fine-tuning strategy: BERT6mA is first trained on a data-rich related species, then fine-tuned on the target species using 5-fold cross-validation. This approach substantially improves AUC for low-data species.

Eleven-species evaluation: The model was evaluated on 6mA datasets spanning 11 organisms including Arabidopsis thaliana, Drosophila melanogaster, Mus musculus, Homo sapiens, Caenorhabditis elegans, and several additional eukaryotes, providing one of the most comprehensive cross-species 6mA benchmarks published at the time.

Online prediction server: A web server hosted at Kyushu Institute of Technology allows users to submit DNA sequences and receive 6mA probability scores without requiring local installation, facilitating use by wet-lab researchers without computational infrastructure.

Technical Details

Applications

Impact

BERT6mA

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

BERT6mA

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources