Overview

MoDNA is a self-supervised pre-training framework for DNA sequence representation that incorporates biological prior knowledge — specifically, the concept of sequence motifs — directly into the training objective. Developed by Weizhi An, Yuzhi Guo, and colleagues at the University of Texas at Arlington, MoDNA was introduced at the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB '22) in August 2022, with an extended journal version published in BioMedInformatics in 2024.

The core motivation behind MoDNA is that standard masked language modeling, as applied to DNA in models such as DNABERT, treats genomic sequences as plain text and ignores the rich regulatory grammar encoded in functional sequence motifs. These short, recurring patterns — recognized by transcription factors, RNA polymerases, and other molecular machinery — carry biological meaning that uniform token masking does not capture. MoDNA addresses this gap by making motif prediction an explicit pre-training task alongside standard token reconstruction.

To achieve computationally efficient training, MoDNA adopts the ELECTRA framework rather than the more widely used BERT paradigm. This marks one of the first applications of the ELECTRA architecture to genomic sequence modeling, yielding a generator-discriminator system that supervises all input tokens during pre-training rather than only the masked subset, substantially improving sample efficiency relative to masked language models of equivalent size.

Key Features

Motif-aware pre-training objective: In addition to reconstructing replaced tokens, MoDNA trains its discriminator to predict the presence or absence of biologically meaningful sequence motifs at each position, encoding regulatory grammar into the learned representations from the start.
ELECTRA-style generator-discriminator architecture: A small generator network proposes plausible token replacements for 6-mer inputs; the discriminator is trained to detect replaced tokens and simultaneously predict motif occurrence labels, providing dense supervision across the entire sequence.
6-mer tokenization: Input sequences up to 512 nucleotides are tokenized into overlapping hexamers (6-mers), following the vocabulary strategy established by DNABERT, which balances vocabulary size with the ability to capture local sequence context.
Self-supervised learning on unlabeled genomes: MoDNA pre-trains on the human reference genome without requiring labeled examples, enabling the model to acquire general-purpose representations that can then be fine-tuned for specific regulatory prediction tasks.
Strong benchmark performance: MoDNA achieved a mean AUC of 0.94 on 506 ENCODE ChIP-seq transcription factor binding datasets, outperforming DeepBind (0.914), and a mean AUC of 0.996 on the CTCF binding site task, surpassing GeneBERT (0.983).

Technical Details

MoDNA uses a two-component transformer architecture inspired by ELECTRA. The generator is a smaller transformer that takes 6-mer tokenized DNA sequences (maximum length 512) with a 15% masking rate and produces candidate replacement tokens. The discriminator is a standard BERT-scale transformer that receives the resulting sequences and is trained on two tasks simultaneously: identifying which tokens were replaced by the generator (a binary classification at every position) and predicting whether a functional motif is present at each position, using motif occurrence labels derived from established databases. This joint objective forces the discriminator to develop representations that are sensitive to both sequence authenticity and regulatory function. Pre-training uses the human reference genome as the primary data source. The ELECTRA training strategy provides full-sequence supervision — every token contributes to the loss — rather than the ~15% of tokens supervised in masked language modeling, making MoDNA more parameter-efficient for a given compute budget.

Applications

MoDNA is designed for regulatory genomics tasks that require understanding of DNA sequence function. Primary demonstrated applications include promoter region prediction and transcription factor binding site identification across diverse ChIP-seq datasets. The model is suitable for any task where researchers need to determine whether a genomic region of interest is functionally active — for example, prioritizing candidate regulatory elements in GWAS loci, annotating non-coding variants, or studying cis-regulatory grammar in gene expression experiments. The pre-trained model can be fine-tuned with relatively small labeled datasets, making it practical for research groups without access to large-scale labeled genomic annotation data.

Impact

MoDNA demonstrated that incorporating domain-specific biological knowledge into genomic pre-training objectives meaningfully improves performance on regulatory prediction tasks compared to treating DNA as plain text. Its adoption of the ELECTRA framework for genomics was an early proof of concept that alternatives to masked language modeling could offer better computational efficiency for DNA models — a theme that has since recurred in subsequent work. The model has been cited in benchmarking studies of DNA language models and in methodological comparisons of genomic pre-training strategies. Limitations include the restriction to sequences of 512 nucleotides (preventing direct modeling of long-range genomic interactions), exclusive pre-training on the human genome (limiting cross-species generalizability without additional fine-tuning), and the use of fixed 6-mer tokenization rather than adaptive or single-nucleotide tokenization approaches explored in more recent models.

Citations

MoDNA: motif-oriented pre-training for DNA language model

An, W., et al. (2022) MoDNA: motif-oriented pre-training for DNA language model. ACM International Conference on Bioinformatics, Computational Biology and Biomedicine.

DOI: 10.1145/3535508.3545512

Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

An, W., et al. (2024) Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA. BioMedInformatics.

DOI: 10.3390/biomedinformatics4020085

Overview

Key Features

Motif-aware pre-training objective: In addition to reconstructing replaced tokens, MoDNA trains its discriminator to predict the presence or absence of biologically meaningful sequence motifs at each position, encoding regulatory grammar into the learned representations from the start.

ELECTRA-style generator-discriminator architecture: A small generator network proposes plausible token replacements for 6-mer inputs; the discriminator is trained to detect replaced tokens and simultaneously predict motif occurrence labels, providing dense supervision across the entire sequence.

6-mer tokenization: Input sequences up to 512 nucleotides are tokenized into overlapping hexamers (6-mers), following the vocabulary strategy established by DNABERT, which balances vocabulary size with the ability to capture local sequence context.

Self-supervised learning on unlabeled genomes: MoDNA pre-trains on the human reference genome without requiring labeled examples, enabling the model to acquire general-purpose representations that can then be fine-tuned for specific regulatory prediction tasks.

Strong benchmark performance: MoDNA achieved a mean AUC of 0.94 on 506 ENCODE ChIP-seq transcription factor binding datasets, outperforming DeepBind (0.914), and a mean AUC of 0.996 on the CTCF binding site task, surpassing GeneBERT (0.983).

Technical Details

Applications

Impact

Citations

MoDNA: motif-oriented pre-training for DNA language model

An, W., et al. (2022) MoDNA: motif-oriented pre-training for DNA language model. ACM International Conference on Bioinformatics, Computational Biology and Biomedicine.

DOI: 10.1145/3535508.3545512

Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

An, W., et al. (2024) Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA. BioMedInformatics.

DOI: 10.3390/biomedinformatics4020085

MoDNA

Overview

Key Features

Technical Details

Applications

Impact

Citations

MoDNA: motif-oriented pre-training for DNA language model

Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

Metrics

Citations

Tags

Resources

MoDNA

Overview

Key Features

Technical Details

Applications

Impact

Citations

MoDNA: motif-oriented pre-training for DNA language model

Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA

Metrics

Citations

Tags

Resources