University of Tübingen
Multi-language transformer framework using five pre-trained language models to predict DNA methylation (6mA, 4mC, 5hmC) across species.
MuLan-Methyl (Multi-Language Methylation prediction) is a deep learning framework for predicting DNA methylation sites across multiple species and methylation types, developed by Wenhuan Zeng, Anupam Gautam, and Daniel H. Huson at the Algorithms in Bioinformatics group within the Institute for Bioinformatics and Medical Informatics at the University of Tübingen, Germany. The work was first posted as a bioRxiv preprint in January 2023 and published in GigaScience in July 2023 (giad054). MuLan-Methyl targets three functionally important DNA modifications: N6-methyladenine (6mA), N4-methylcytosine (4mC), and 5-hydroxymethylcytosine (5hmC) — modifications found across phylogenetically diverse organisms and associated with transcriptional regulation, DNA repair, and epigenetic inheritance.
The "multi-language" design philosophy that gives the model its name is its ensemble of five independently trained transformer-based language models — BERT, DistilBERT, ALBERT, XLNet, and ELECTRA — each pre-trained on a shared DNA sequence corpus and fine-tuned on the same methylation classification tasks. Rather than selecting the "best" language model architecture, MuLan-Methyl treats the five models as complementary learners whose predictions are averaged, leveraging the diversity of architectural inductive biases to produce a more robust ensemble. This approach is analogous to ensemble methods in classical machine learning but applies the principle at the level of pre-trained language model architectures.
A distinctive aspect of MuLan-Methyl is its taxonomy-aware training corpus. Rather than treating DNA sequences as pure sequence data, the model's pre-training corpus incorporates taxonomic lineage information from NCBI and GTDB for each training sequence, allowing the model to learn how methylation patterns vary across the tree of life and to generalize predictions to species not seen during training. This cross-species generalization capability is experimentally validated by applying the model to genomes whose taxonomic lineages were excluded from the training set.
MuLan-Methyl's pre-training phase processes a corpus of 2.44 million sentences generated from (a) DNA sequence fragments of 41 nucleotides centered on candidate modification sites, encoded as 6-mers and concatenated with (b) taxonomic lineage descriptions (e.g., "Eukaryota; Viridiplantae; Streptophyta; Arabidopsis thaliana") derived from the NCBI taxonomy and GTDB databases for the source organism. All five language model architectures (BERT, DistilBERT, ALBERT, XLNet, ELECTRA) are then pre-trained from scratch on this corpus using masked language modeling, with each model using its own tokenizer trained on the 25,000-word vocabulary. Fine-tuning adapts each pre-trained model to the binary classification task of predicting whether a candidate site carries a specific methylation mark, using the 17-dataset benchmark derived from the iDNA-MS collection.
Benchmark evaluation on the iDNA-MS independent test sets showed MuLan-Methyl outperforming competing methods (iDNA-ABF and iDNA-ABT) in 13 of 17 methylation type-species combinations. Performance gains were most pronounced for methylation types with strong sequence-level determinants (6mA in several organisms) and for species with moderate training data where the taxonomic context provided additional discriminative signal. Cross-species generalization experiments showed that withholding an organism's lineage from training and then evaluating on that organism resulted in moderate performance degradation, but the model remained substantially better than random prediction, demonstrating genuine cross-species transfer.
MuLan-Methyl is particularly useful in comparative genomics and metagenomics contexts where researchers need methylation predictions for non-model organisms or for taxonomically diverse sets of genomes. Microbiologists studying epigenetic regulation in bacteria and archaea — where 6mA and 4mC are the dominant methylation types and play important roles in restriction-modification defense and transcriptional regulation — benefit from the model's broad species coverage. Plant biologists studying epigenomic variation across crop species and their wild relatives can apply MuLan-Methyl to predict methylation patterns in species for which experimental data are unavailable. The model is also applicable in environmental metagenomics, where assembled genomic sequences from complex microbial communities can be processed to predict methylation patterns as a proxy for epigenetic state in uncultured organisms. Researchers who need a computationally accessible, broadly applicable methylation prediction tool will find MuLan-Methyl's web server and open-source code straightforward to use. The taxonomy-aware design also makes it a natural starting point for studies that need to connect epigenomic variation to evolutionary diversification, for instance in phylogenomics studies that examine how methylation patterns correlate with speciation events or ecological adaptation.
MuLan-Methyl represents a methodological advance by applying transformer language model ensembles — a strategy common in NLP but relatively novel in epigenomics — to DNA modification prediction, and by demonstrating that incorporating taxonomic lineage information in the pre-training corpus provides a concrete mechanism for cross-species generalization. The GigaScience publication and accompanying web server have made the approach accessible to a broad community. The work also highlights the value of model diversity: despite the extra computation required to train five language models, the ensemble consistently outperforms individual models, providing a practical argument for investing in multi-model workflows for biological sequence analysis. A key limitation is the computational cost of training and maintaining five large pre-trained language models; future work using knowledge distillation or more parameter-efficient architectures could reduce this overhead without sacrificing ensemble accuracy.