Overview

The Species-Aware DNA Language Model is a masked language model (MLM) for genomic sequences that explicitly incorporates species identity as a conditioning signal during training and inference. Developed by Dennis Gankin, Alexander Karollus, and colleagues at the Technical University of Munich under Julien Gagneur, the model addresses a fundamental challenge in regulatory genomics: how to leverage conservation across distantly related genomes while still accounting for species-specific sequence variation and oligomer composition biases. The preprint appeared in January 2023, with the peer-reviewed version published in Genome Biology in 2024.

Standard DNA language models treat all input sequences as coming from an undifferentiated genomic space. This creates a problem when training on multi-species data: the model conflates true regulatory conservation with phylogenetic background composition differences, weakening its ability to identify functionally meaningful motifs. By conditioning on a discrete species embedding at training time, the species-aware approach allows the model to separate what is conserved because it is functional from what merely reflects the base composition of a given clade.

The result is a model that can reconstruct and represent non-coding regulatory sequences across more than 800 species spanning over 500 million years of evolutionary divergence — a timescale far exceeding the range over which conventional multiple sequence alignment remains informative. This breadth gives the model access to a rich signal of purifying selection that pure sequence-based methods cannot exploit at such evolutionary distances.

Key Features

Species-conditioned masked language modeling: The model receives an explicit species identifier alongside genomic sequence, allowing it to disentangle lineage-specific base-composition biases from functionally conserved regulatory signals across the training corpus.
Ultra-deep evolutionary coverage: Training data spans 800+ species and 500+ million years of evolution, capturing conservation signals that extend well beyond the reach of traditional pairwise or multiple sequence alignments.
Motif discovery from reconstruction probability: The model's per-nucleotide reconstruction probability serves as an unsupervised signal for identifying transcription factor binding sites and RNA-binding protein (RBP) motifs, with bound motif instances scoring higher than unbound ones.
Improved gene expression prediction: Species-aware embeddings extracted from 3' UTR sequences outperform species-agnostic baselines on endogenous and massively parallel reporter assay (MPRA)-based mRNA half-life prediction tasks in yeast.
Flexible, configurable architecture: The framework, built with PyTorch Lightning and Hydra configuration management, supports customization of layer count, hidden dimensionality, and dropout, enabling adaptation to new downstream tasks.

Technical Details

The model is implemented as a masked language model operating on DNA sequence tokens, analogous to BERT-style pretraining but applied to nucleotide vocabularies. Two main variants are benchmarked: a species-agnostic baseline (motif_s4_resnet) and the proposed species-aware architecture (species_dss), which injects a learned species embedding into the sequence representation. Training uses a standard masked nucleotide prediction objective on sequences drawn from more than 800 genomes, with data and pretrained checkpoints deposited on Zenodo (record 7569953) to support reproducibility.

Downstream evaluations are conducted primarily on Saccharomyces cerevisiae and Schizosaccharomyces pombe, with motif-level analyses also covering Neurospora crassa. For mRNA half-life prediction, embeddings from 3' UTR windows are used as input features. Reconstruction probability — the model's confidence in predicting each masked nucleotide — is used directly as a motif scoring function without any fine-tuning, demonstrating that the pretrained representations encode biologically meaningful information. Quantitative comparisons show that species-aware training yields consistently higher performance than species-agnostic models and k-mer frequency baselines across gene expression and motif tasks.

Applications

The model is particularly relevant for researchers working in regulatory genomics who want to identify functional non-coding elements — promoters, UTR regulatory sequences, transcription factor binding sites, and RBP interaction motifs — without relying on curated alignment databases. Computational biologists studying gene regulation across non-model organisms benefit from the model's ability to generalize across large evolutionary distances. It can also serve as a pretrained backbone for fine-tuning on specific prediction tasks, such as predicting mRNA stability, splicing efficiency, or enhancer activity, wherever multi-species conservation is an informative feature.

Impact

Published in Genome Biology, this work represents a methodological advance in how multi-species genomic data can be incorporated into foundation models for regulatory sequence analysis. By making species identity an explicit, learnable input rather than an implicit confound, the approach offers a principled solution to a problem that affects any DNA model trained on taxonomically diverse sequence collections. The publicly available pretrained checkpoints and training code lower the barrier for groups without large compute resources to apply multi-species pre-training to their genomic question of interest. A current limitation is that the model's evaluations focus primarily on fungal organisms and yeast UTR biology; broader validation across mammalian regulatory elements and diverse genome sizes remains an open direction for future work.

Overview

Key Features

Species-conditioned masked language modeling: The model receives an explicit species identifier alongside genomic sequence, allowing it to disentangle lineage-specific base-composition biases from functionally conserved regulatory signals across the training corpus.

Ultra-deep evolutionary coverage: Training data spans 800+ species and 500+ million years of evolution, capturing conservation signals that extend well beyond the reach of traditional pairwise or multiple sequence alignments.

Motif discovery from reconstruction probability: The model's per-nucleotide reconstruction probability serves as an unsupervised signal for identifying transcription factor binding sites and RNA-binding protein (RBP) motifs, with bound motif instances scoring higher than unbound ones.

Improved gene expression prediction: Species-aware embeddings extracted from 3' UTR sequences outperform species-agnostic baselines on endogenous and massively parallel reporter assay (MPRA)-based mRNA half-life prediction tasks in yeast.

Flexible, configurable architecture: The framework, built with PyTorch Lightning and Hydra configuration management, supports customization of layer count, hidden dimensionality, and dropout, enabling adaptation to new downstream tasks.

Technical Details

Applications

Impact

Species-Aware DNA Language Model

Overview

Key Features

Technical Details

Applications

Impact

Citation