bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
DNA & Gene

MuLan-Methyl

University of Tübingen

Multi-language transformer framework using five pre-trained language models to predict DNA methylation (6mA, 4mC, 5hmC) across species.

Released: 2023

Overview

MuLan-Methyl (Multi-Language Methylation prediction) is a deep learning framework for predicting DNA methylation sites across multiple species and methylation types, developed by Wenhuan Zeng, Anupam Gautam, and Daniel H. Huson at the Algorithms in Bioinformatics group within the Institute for Bioinformatics and Medical Informatics at the University of Tübingen, Germany. The work was first posted as a bioRxiv preprint in January 2023 and published in GigaScience in July 2023 (giad054). MuLan-Methyl targets three functionally important DNA modifications: N6-methyladenine (6mA), N4-methylcytosine (4mC), and 5-hydroxymethylcytosine (5hmC) — modifications found across phylogenetically diverse organisms and associated with transcriptional regulation, DNA repair, and epigenetic inheritance.

The "multi-language" design philosophy that gives the model its name is its ensemble of five independently trained transformer-based language models — BERT, DistilBERT, ALBERT, XLNet, and ELECTRA — each pre-trained on a shared DNA sequence corpus and fine-tuned on the same methylation classification tasks. Rather than selecting the "best" language model architecture, MuLan-Methyl treats the five models as complementary learners whose predictions are averaged, leveraging the diversity of architectural inductive biases to produce a more robust ensemble. This approach is analogous to ensemble methods in classical machine learning but applies the principle at the level of pre-trained language model architectures.

A distinctive aspect of MuLan-Methyl is its taxonomy-aware training corpus. Rather than treating DNA sequences as pure sequence data, the model's pre-training corpus incorporates taxonomic lineage information from NCBI and GTDB for each training sequence, allowing the model to learn how methylation patterns vary across the tree of life and to generalize predictions to species not seen during training. This cross-species generalization capability is experimentally validated by applying the model to genomes whose taxonomic lineages were excluded from the training set.

Key Features

  • Ensemble of five transformer language models: MuLan-Methyl trains and evaluates BERT, DistilBERT, ALBERT, XLNet, and ELECTRA independently on the same methylation prediction task. Final predictions are produced by averaging the probability outputs from all five models, producing an ensemble that is more accurate and robust than any single model alone.
  • Taxonomy-aware pre-training corpus: The pre-training corpus combines DNA sequence fragments (encoded as 6-mers) with taxonomic lineage descriptions from the NCBI taxonomy and the GTDB database. This joint corpus of 2.44 million "sentences" — each pairing a DNA fragment with its organism's taxonomic classification — teaches the models how sequence context and evolutionary lineage jointly encode methylation patterns.
  • Custom tokenizers per model: Each of the five language models receives a custom tokenizer trained on the pre-training corpus, adapted to the combined vocabulary of DNA k-mers and taxonomic terms. A shared vocabulary of 25,000 words covers the space of 6-mer DNA tokens and taxonomy terms across the training data.
  • Cross-species generalization: By encoding taxonomic lineage as part of the input representation, MuLan-Methyl can generate predictions for species not present in the training data. Validation experiments showed that the model maintains reasonable prediction performance for novel taxonomic lineages, outperforming methods that lack taxonomic context.
  • Three methylation types in a unified framework: A single trained MuLan-Methyl ensemble handles 6mA, 4mC, and 5hmC prediction without requiring separate specialist models for each type. The model is evaluated on 17 benchmark combinations of methylation type and species drawn from the iDNA-MS dataset, the same benchmark used to evaluate competing methods.
  • Web server deployment: A publicly accessible web server at the University of Tübingen (plabase.cs.uni-tuebingen.de/mm/) allows users to submit sequences and receive methylation predictions without local installation, extending the tool's reach to researchers without computational infrastructure.

Technical Details

MuLan-Methyl's pre-training phase processes a corpus of 2.44 million sentences generated from (a) DNA sequence fragments of 41 nucleotides centered on candidate modification sites, encoded as 6-mers and concatenated with (b) taxonomic lineage descriptions (e.g., "Eukaryota; Viridiplantae; Streptophyta; Arabidopsis thaliana") derived from the NCBI taxonomy and GTDB databases for the source organism. All five language model architectures (BERT, DistilBERT, ALBERT, XLNet, ELECTRA) are then pre-trained from scratch on this corpus using masked language modeling, with each model using its own tokenizer trained on the 25,000-word vocabulary. Fine-tuning adapts each pre-trained model to the binary classification task of predicting whether a candidate site carries a specific methylation mark, using the 17-dataset benchmark derived from the iDNA-MS collection.

Benchmark evaluation on the iDNA-MS independent test sets showed MuLan-Methyl outperforming competing methods (iDNA-ABF and iDNA-ABT) in 13 of 17 methylation type-species combinations. Performance gains were most pronounced for methylation types with strong sequence-level determinants (6mA in several organisms) and for species with moderate training data where the taxonomic context provided additional discriminative signal. Cross-species generalization experiments showed that withholding an organism's lineage from training and then evaluating on that organism resulted in moderate performance degradation, but the model remained substantially better than random prediction, demonstrating genuine cross-species transfer.

Applications

MuLan-Methyl is particularly useful in comparative genomics and metagenomics contexts where researchers need methylation predictions for non-model organisms or for taxonomically diverse sets of genomes. Microbiologists studying epigenetic regulation in bacteria and archaea — where 6mA and 4mC are the dominant methylation types and play important roles in restriction-modification defense and transcriptional regulation — benefit from the model's broad species coverage. Plant biologists studying epigenomic variation across crop species and their wild relatives can apply MuLan-Methyl to predict methylation patterns in species for which experimental data are unavailable. The model is also applicable in environmental metagenomics, where assembled genomic sequences from complex microbial communities can be processed to predict methylation patterns as a proxy for epigenetic state in uncultured organisms. Researchers who need a computationally accessible, broadly applicable methylation prediction tool will find MuLan-Methyl's web server and open-source code straightforward to use. The taxonomy-aware design also makes it a natural starting point for studies that need to connect epigenomic variation to evolutionary diversification, for instance in phylogenomics studies that examine how methylation patterns correlate with speciation events or ecological adaptation.

Impact

MuLan-Methyl represents a methodological advance by applying transformer language model ensembles — a strategy common in NLP but relatively novel in epigenomics — to DNA modification prediction, and by demonstrating that incorporating taxonomic lineage information in the pre-training corpus provides a concrete mechanism for cross-species generalization. The GigaScience publication and accompanying web server have made the approach accessible to a broad community. The work also highlights the value of model diversity: despite the extra computation required to train five language models, the ensemble consistently outperforms individual models, providing a practical argument for investing in multi-model workflows for biological sequence analysis. A key limitation is the computational cost of training and maintaining five large pre-trained language models; future work using knowledge distillation or more parameter-efficient architectures could reduce this overhead without sacrificing ensemble accuracy.

Tags

epigenomic predictionvariant effect predictiontransformertransfer learningself-supervisedfoundation modelDNA methylationepigenomics

Resources

GitHub RepositoryResearch PaperOfficial Website