Technical University of Munich
Masked DNA language model trained on 800+ species spanning 500M years of evolution, using explicit species conditioning to capture conserved regulatory elements.
The Species-Aware DNA Language Model is a masked language model (MLM) for genomic sequences that explicitly incorporates species identity as a conditioning signal during training and inference. Developed by Dennis Gankin, Alexander Karollus, and colleagues at the Technical University of Munich under Julien Gagneur, the model addresses a fundamental challenge in regulatory genomics: how to leverage conservation across distantly related genomes while still accounting for species-specific sequence variation and oligomer composition biases. The preprint appeared in January 2023, with the peer-reviewed version published in Genome Biology in 2024.
Standard DNA language models treat all input sequences as coming from an undifferentiated genomic space. This creates a problem when training on multi-species data: the model conflates true regulatory conservation with phylogenetic background composition differences, weakening its ability to identify functionally meaningful motifs. By conditioning on a discrete species embedding at training time, the species-aware approach allows the model to separate what is conserved because it is functional from what merely reflects the base composition of a given clade.
The result is a model that can reconstruct and represent non-coding regulatory sequences across more than 800 species spanning over 500 million years of evolutionary divergence — a timescale far exceeding the range over which conventional multiple sequence alignment remains informative. This breadth gives the model access to a rich signal of purifying selection that pure sequence-based methods cannot exploit at such evolutionary distances.
The model is implemented as a masked language model operating on DNA sequence tokens, analogous to BERT-style pretraining but applied to nucleotide vocabularies. Two main variants are benchmarked: a species-agnostic baseline (motif_s4_resnet) and the proposed species-aware architecture (species_dss), which injects a learned species embedding into the sequence representation. Training uses a standard masked nucleotide prediction objective on sequences drawn from more than 800 genomes, with data and pretrained checkpoints deposited on Zenodo (record 7569953) to support reproducibility.
Downstream evaluations are conducted primarily on Saccharomyces cerevisiae and Schizosaccharomyces pombe, with motif-level analyses also covering Neurospora crassa. For mRNA half-life prediction, embeddings from 3' UTR windows are used as input features. Reconstruction probability — the model's confidence in predicting each masked nucleotide — is used directly as a motif scoring function without any fine-tuning, demonstrating that the pretrained representations encode biologically meaningful information. Quantitative comparisons show that species-aware training yields consistently higher performance than species-agnostic models and k-mer frequency baselines across gene expression and motif tasks.
The model is particularly relevant for researchers working in regulatory genomics who want to identify functional non-coding elements — promoters, UTR regulatory sequences, transcription factor binding sites, and RBP interaction motifs — without relying on curated alignment databases. Computational biologists studying gene regulation across non-model organisms benefit from the model's ability to generalize across large evolutionary distances. It can also serve as a pretrained backbone for fine-tuning on specific prediction tasks, such as predicting mRNA stability, splicing efficiency, or enhancer activity, wherever multi-species conservation is an informative feature.
Published in Genome Biology, this work represents a methodological advance in how multi-species genomic data can be incorporated into foundation models for regulatory sequence analysis. By making species identity an explicit, learnable input rather than an implicit confound, the approach offers a principled solution to a problem that affects any DNA model trained on taxonomically diverse sequence collections. The publicly available pretrained checkpoints and training code lower the barrier for groups without large compute resources to apply multi-species pre-training to their genomic question of interest. A current limitation is that the model's evaluations focus primarily on fungal organisms and yeast UTR biology; broader validation across mammalian regulatory elements and diverse genome sizes remains an open direction for future work.
Karollus, A., et al. (2023) Species-aware DNA language models capture regulatory elements and their evolution. bioRxiv.
DOI: 10.1186/s13059-024-03221-x