Technical University of Munich
DNA language model trained on over 800 species spanning 500 million years of evolution to capture conserved regulatory elements and their evolution beyond sequence alignment.
The Species-Aware DNA Language Model (referred to as DNA-LM in the literature) is a family of masked language models for DNA sequences, developed by Dennis Gankin, Alexander Karollus, Martin Grosshauser, Kristian Klemon, Johannes Hingerl, and Julien Gagneur at the Technical University of Munich. Initially posted to bioRxiv in January 2023 and published in Genome Biology in 2024, the model introduces a key innovation: explicit conditioning on species identity during training, enabling the model to learn regulatory sequence features that are conserved across hundreds of millions of years of evolution while accounting for the inevitable drift in non-functional sequence.
The central challenge addressed by species-aware DNA language modeling is evolutionary context. Standard DNA language models trained on a single reference genome (such as the human genome) learn to predict masked nucleotides from local sequence context but cannot distinguish conserved functional sequences from neutrally evolving non-functional sequences. By training simultaneously on over 800 vertebrate species — spanning more than 500 million years of divergence — and conditioning on species identity via learned species embeddings, the model can leverage evolutionary conservation as a powerful implicit signal for functional importance. Crucially, this approach captures regulatory elements across evolutionary distances that exceed what traditional pairwise sequence alignment can achieve, opening a new window on deeply conserved but diverged regulatory elements.
The Gagneur lab at TUM is known for combining deep learning with statistical genomics, and the species-aware DNA language model fits within a broader research program connecting sequence models to gene expression prediction and regulatory variant interpretation. The model's training on hundreds of species from annotated genomes across vertebrate diversity represents one of the broadest taxonomic spans used for DNA language model pre-training at the time of publication, and the systematic evaluation of species-conditioned versus species-agnostic training provides valuable empirical guidance for the field of genomic foundation models.
The species-aware DNA language model uses a BERT-style masked language model architecture, with nucleotide k-mers as tokens and standard transformer encoder blocks with multi-head self-attention. Species identity is incorporated via a learned species embedding that is added to the input representation at each position, analogous to sentence-type embeddings in BERT. Training was conducted on genomic sequences from over 800 vertebrate species spanning mammals, birds, reptiles, amphibians, and fish, assembled from public genome repositories including Ensembl and UCSC. For each genomic region used in training, orthologous sequences across species were used, with species embeddings distinguishing the species of origin. The masked language modeling objective randomly masks 15% of nucleotide tokens and trains the model to reconstruct them from context. Multiple model variants were trained to evaluate the effect of species awareness: species-aware models (with species embeddings), species-agnostic models (single model trained on multi-species data without conditioning), and single-species models trained on the human genome only. In benchmarks on regulatory element prediction, species-aware models consistently outperformed both species-agnostic and single-species baselines. On MPRA-based regulatory activity prediction, species-aware representations improved Pearson correlation by several percentage points over human-only baselines. On motif discovery, species-aware models produced higher-quality position weight matrices that better matched known JASPAR binding site profiles. The published version in Genome Biology in 2024 provides additional validation and refined benchmarks not available in the preprint.
The species-aware DNA language model has direct applications in regulatory genomics, evolutionary biology, and noncoding variant interpretation. For regulatory element discovery, the model's evolutionary-context-aware representations identify functional regulatory sequences that lack strong conservation in pairwise alignments — a class of elements that has historically been invisible to alignment-based regulatory annotation pipelines. Gene expression prediction from sequence is a validated downstream application: models fine-tuned from species-aware representations achieve higher accuracy on MPRA datasets and expression prediction benchmarks than those starting from human-only or species-agnostic pre-training. For noncoding variant interpretation, the model's sensitivity to evolutionarily constrained positions translates to improved scoring of regulatory variants from GWAS and clinical resequencing studies. The model is also valuable for comparative genomics studies that seek to understand how regulatory sequences have diverged across vertebrate evolution, and for identifying which TF binding site sequences have been conserved versus redesigned during lineage-specific regulatory evolution.
The species-aware DNA language model contributed an important conceptual advance to the genomic foundation model field: explicit modeling of evolutionary context as a form of multi-species self-supervision yields representations that encode functional conservation signals inaccessible to single-species training. The systematic comparison of species-aware versus species-agnostic models provided clear empirical evidence for the benefit of this approach, influencing subsequent multi-species genomic model designs. Publication in Genome Biology in 2024 provided peer-reviewed validation of the key claims. The work connects to a broader trend in genomic deep learning of using evolutionary information — whether through multiple sequence alignments (as in Evo), conservation scores (as in phyloP features), or multi-species masked language modeling — as a powerful inductive bias for learning biologically meaningful sequence representations. Limitations include the focus on vertebrate genomes (lacking coverage of plant and fungal regulatory sequences that may operate under different evolutionary constraints) and the computational cost of training on 800 species that may limit accessibility for groups without large-scale computing infrastructure.