Nanjing University of Science and Technology
A hybrid deep learning model predicting 12 types of RNA modifications by fine-tuning DNABERT representations fused with CNN-encoded sequence features.
RNA modifications — chemical alterations to nucleotide residues after transcription — are now recognized as a central layer of gene expression regulation. Over 170 distinct RNA modifications have been documented, with types such as N6-methyladenosine (m6A), 5-methylcytosine (m5C), N1-methyladenosine (m1A), and pseudouridine (Psi) playing critical roles in mRNA stability, translation efficiency, and protein-RNA interactions. Disruption of these marks is linked to cancer, neurological disorders, and developmental defects. However, experimental mapping of modification sites genome-wide is expensive and technically demanding, making computational prediction an important complement to wet-lab approaches.
MRM-BERT is a unified computational framework developed by Ying Zhang and colleagues at Nanjing University of Science and Technology (NJUST) and published in IEEE/ACM Transactions on Computational Biology and Bioinformatics in 2023. The model addresses a persistent gap in the field: most existing predictors target only a single modification type, requiring researchers to run and manage multiple specialized tools. MRM-BERT instead offers a single, shared architecture capable of predicting 12 distinct RNA modification types across multiple species from raw sequence context alone.
The core strategy is a hybrid neural network that combines the contextual sequence representations learned by DNABERT — a pre-trained bidirectional transformer for nucleotide sequences — with convolutional neural network (CNN) modules trained on traditional biologically motivated encodings. Fine-tuning DNABERT on each modification task captures long-range nucleotide context, while the CNN branch captures localized sequence motifs. The two branches are fused at the fully connected layer stage to produce a joint classification prediction, outperforming single-representation methods across all 12 modification benchmarks.
MRM-BERT processes centered sequence windows of 101 nucleotides around candidate modification sites. The BERT branch tokenizes these windows into 99 overlapping trinucleotide (3-mer) tokens using a sliding window and passes them through the 12-layer DNABERT transformer (768 hidden dimensions), which was originally pre-trained on human genomic sequences. Fine-tuning is performed independently for each of the 12 modification types, updating the full DNABERT stack plus a task-specific classification head. The CNN branch independently encodes sequences using selected traditional feature representations (including one-hot nucleotide encoding and k-mer frequency profiles), which are processed by convolutional filters optimized for local sequence motifs. The final hidden representations from both branches are concatenated and passed through fully connected layers to produce binary site/non-site predictions. Evaluation across 12 independently curated benchmark datasets consistently showed AUC superiority over contemporary single-representation methods.
MRM-BERT is primarily targeted at researchers in epitranscriptomics, RNA biology, and computational genomics who need to annotate potential modification sites across transcriptomes without access to high-throughput experimental profiling. Practical applications include prioritizing candidate sites for validation by m6A-seq, bisulfite sequencing, or other detection methods; screening unannotated transcriptomes in non-model organisms; and supporting mechanistic studies of RNA modification reader and writer proteins. The multi-modification scope also makes MRM-BERT useful as a component in integrative pipelines exploring co-occurrence of different modification types on the same transcript.
MRM-BERT represents a meaningful step toward unified epitranscriptomic site prediction by demonstrating that a fine-tuned nucleotide language model, combined with conventional sequence encoding, can generalize across modification chemistry and species simultaneously. Published in a peer-reviewed IEEE/ACM journal venue, the work contributes to the growing body of evidence that BERT-style transfer learning is applicable beyond protein sequences to RNA biology. Its open-source availability on GitHub facilitates direct adoption and extension by the community. A limitation worth noting is the reliance on DNABERT — a model pre-trained on DNA rather than RNA sequences — meaning it does not capture RNA-specific structural or chemical context; future work incorporating RNA-specific pretrained models (such as RNA-FM or RNAErnie) may improve performance further. Nonetheless, MRM-BERT's unified multi-modification framework set a practical benchmark for subsequent multi-task RNA modification prediction approaches.
Zhang, Y., et al. (2023) Prediction of Multiple Types of RNA Modifications via Biological Language Model. IEEE/ACM Transactions on Computational Biology & Bioinformatics.
DOI: 10.1109/TCBB.2023.3283985