China University of Mining and Technology
Ensemble multiscale deep learning model for RNA methylation site prediction, combining dilated convolution and BiLSTM with multiple sequence encodings.
EMDLP (Ensemble Multiscale Deep Learning Predictor) is a computational tool developed at the China University of Mining and Technology (CUMT) for the identification of RNA methylation sites from sequence data. RNA methylation — particularly N6-methyladenosine (m6A) and N1-methyladenosine (m1A) — is among the most prevalent and functionally important post-transcriptional RNA modifications. These chemical marks regulate mRNA stability, translation efficiency, and splicing, and their dysregulation has been linked to cancer, neurological disorders, and viral infection. Accurate computational prediction of methylation sites is therefore essential for researchers who lack access to high-throughput epitranscriptomic sequencing methods such as MeRIP-seq.
EMDLP addresses this problem by combining multiple sequence encoding strategies with a hybrid neural architecture that captures both local sequence context and long-range dependencies simultaneously. The key insight driving the model is that different encodings reveal complementary information about the sequence neighborhood around a putative methylation site, and that an ensemble integrating all three encoding streams yields more robust predictions than any single strategy alone.
Published in BMC Bioinformatics in June 2022, EMDLP demonstrates state-of-the-art performance on benchmark datasets for both m1A and m6A site prediction, and is accompanied by a publicly accessible web server that allows users to submit RNA sequences without installing local software.
EMDLP's core architecture — the DCB module — chains a dilated convolutional neural network to a Bidirectional LSTM. Dilated convolutions expand the effective receptive field exponentially with depth without increasing the number of parameters, allowing the model to integrate information from nucleotides far upstream and downstream of the candidate site. The subsequent BiLSTM then processes the convolved feature maps bidirectionally, capturing sequential dependencies that pure CNNs cannot model. This combination is applied independently to features derived from each of the three encoding schemes: one-hot vectors provide a sparse, position-specific representation; RNA word embeddings project k-mer substrings into dense continuous space; and RGloVe augments the standard GloVe objective with RNA-specific co-occurrence statistics for improved contextual representations.
The three DCB branches produce independent probability scores for each candidate site, which are combined using soft voting — averaging the predicted probabilities — to produce the final classification. On benchmark datasets, EMDLP achieved an AUROC of 95.56% for m1A prediction and 85.24% for m6A prediction, exceeding previously reported state-of-the-art results on both tasks at the time of publication. Training and evaluation followed the standard positive/negative split conventions used in the RNA modification prediction literature, with balanced sampling to mitigate class imbalance.
EMDLP is designed for researchers studying epitranscriptomics — the chemical modification landscape of RNA — who need to prioritize candidate methylation sites for experimental validation. Molecular biologists investigating the regulatory roles of m6A or m1A in a specific transcript can use the web server to rapidly assess which adenosine positions are most likely to carry modifications, reducing the cost and scale of downstream MeRIP-seq or antibody-based enrichment experiments. The tool is also useful in large-scale transcriptome analyses: given a set of transcript sequences, EMDLP can generate genome-wide methylation site predictions that serve as hypotheses for follow-up functional assays. Cancer researchers and RNA biologists investigating modifications in viral RNAs or non-coding RNA classes can apply the model to any RNA species of interest, provided that training-domain considerations are kept in mind.
EMDLP contributes to a growing body of sequence-based tools for epitranscriptomic site prediction, sitting alongside methods such as SRAMP, m6ANet, and DeepM6ASeq. Its primary technical contribution is the systematic integration of three encoding strategies through a single unified ensemble framework, demonstrating that encoding diversity provides complementary signal over using any single representation. The accompanying web server lowers the barrier to entry for wet-lab researchers without bioinformatics infrastructure. The model is a relatively focused, task-specific tool rather than a broadly pre-trained foundation model, which means its predictions are most reliable within the sequence contexts represented in its training data. Users applying EMDLP to RNA species or organisms substantially different from the training distribution should interpret predictions with appropriate caution and consider retraining on domain-specific data using the provided codebase.
Wang, H., et al. (2022) EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction. BMC Bioinformatics.
DOI: 10.1186/s12859-022-04756-1