University of Texas at Arlington
Motif-oriented DNA pre-training framework using an ELECTRA-style generator-discriminator architecture to learn biologically informed genomic representations.
MoDNA is a self-supervised pre-training framework for DNA sequence representation that incorporates biological prior knowledge — specifically, the concept of sequence motifs — directly into the training objective. Developed by Weizhi An, Yuzhi Guo, and colleagues at the University of Texas at Arlington, MoDNA was introduced at the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '22) in August 2022, with an extended journal version published in BioMedInformatics in 2024.
The core motivation behind MoDNA is that standard masked language modeling, as applied to DNA in models such as DNABERT, treats genomic sequences as plain text and ignores the rich regulatory grammar encoded in functional sequence motifs. These short, recurring patterns — recognized by transcription factors, RNA polymerases, and other molecular machinery — carry biological meaning that uniform token masking does not capture. MoDNA addresses this gap by making motif prediction an explicit pre-training task alongside standard token reconstruction.
To achieve computationally efficient training, MoDNA adopts the ELECTRA framework rather than the more widely used BERT paradigm. This marks one of the first applications of the ELECTRA architecture to genomic sequence modeling, yielding a generator-discriminator system that supervises all input tokens during pre-training rather than only the masked subset, substantially improving sample efficiency relative to masked language models of equivalent size.
MoDNA uses a two-component transformer architecture inspired by ELECTRA. The generator is a smaller transformer that takes 6-mer tokenized DNA sequences (maximum length 512) with a 15% masking rate and produces candidate replacement tokens. The discriminator is a standard BERT-scale transformer that receives the resulting sequences and is trained on two tasks simultaneously: identifying which tokens were replaced by the generator (a binary classification at every position) and predicting whether a functional motif is present at each position, using motif occurrence labels derived from established databases. This joint objective forces the discriminator to develop representations that are sensitive to both sequence authenticity and regulatory function. Pre-training uses the human reference genome as the primary data source. The ELECTRA training strategy provides full-sequence supervision — every token contributes to the loss — rather than the ~15% of tokens supervised in masked language modeling, making MoDNA more parameter-efficient for a given compute budget.
MoDNA is designed for regulatory genomics tasks that require understanding of DNA sequence function. Primary demonstrated applications include promoter region prediction and transcription factor binding site identification across diverse ChIP-seq datasets. The model is suitable for any task where researchers need to determine whether a genomic region of interest is functionally active — for example, prioritizing candidate regulatory elements in GWAS loci, annotating non-coding variants, or studying cis-regulatory grammar in gene expression experiments. The pre-trained model can be fine-tuned with relatively small labeled datasets, making it practical for research groups without access to large-scale labeled genomic annotation data.
MoDNA demonstrated that incorporating domain-specific biological knowledge into genomic pre-training objectives meaningfully improves performance on regulatory prediction tasks compared to treating DNA as plain text. Its adoption of the ELECTRA framework for genomics was an early proof of concept that alternatives to masked language modeling could offer better computational efficiency for DNA models — a theme that has since recurred in subsequent work. The model has been cited in benchmarking studies of DNA language models and in methodological comparisons of genomic pre-training strategies. Limitations include the restriction to sequences of 512 nucleotides (preventing direct modeling of long-range genomic interactions), exclusive pre-training on the human genome (limiting cross-species generalizability without additional fine-tuning), and the use of fixed 6-mer tokenization rather than adaptive or single-nucleotide tokenization approaches explored in more recent models.
An, W., et al. (2022) MoDNA: motif-oriented pre-training for DNA language model. ACM International Conference on Bioinformatics, Computational Biology and Biomedicine.
DOI: 10.1145/3535508.3545512An, W., et al. (2024) Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA. BioMedInformatics.
DOI: 10.3390/biomedinformatics4020085