Overview

The Species-Aware DNA Language Model (referred to as DNA-LM in the literature) is a family of masked language models for DNA sequences, developed by Dennis Gankin, Alexander Karollus, Martin Grosshauser, Kristian Klemon, Johannes Hingerl, and Julien Gagneur at the Technical University of Munich. Initially posted to bioRxiv in January 2023 and published in Genome Biology in 2024, the model introduces a key innovation: explicit conditioning on species identity during training, enabling the model to learn regulatory sequence features that are conserved across hundreds of millions of years of evolution while accounting for the inevitable drift in non-functional sequence.

The central challenge addressed by species-aware DNA language modeling is evolutionary context. Standard DNA language models trained on a single reference genome (such as the human genome) learn to predict masked nucleotides from local sequence context but cannot distinguish conserved functional sequences from neutrally evolving non-functional sequences. By training simultaneously on over 800 vertebrate species — spanning more than 500 million years of divergence — and conditioning on species identity via learned species embeddings, the model can leverage evolutionary conservation as a powerful implicit signal for functional importance. Crucially, this approach captures regulatory elements across evolutionary distances that exceed what traditional pairwise sequence alignment can achieve, opening a new window on deeply conserved but diverged regulatory elements.

The Gagneur lab at TUM is known for combining deep learning with statistical genomics, and the species-aware DNA language model fits within a broader research program connecting sequence models to gene expression prediction and regulatory variant interpretation. The model's training on hundreds of species from annotated genomes across vertebrate diversity represents one of the broadest taxonomic spans used for DNA language model pre-training at the time of publication, and the systematic evaluation of species-conditioned versus species-agnostic training provides valuable empirical guidance for the field of genomic foundation models.

Key Features

Species conditioning with 800+ species: Conditions on species identity via learned embeddings during masked language modeling, enabling the model to distinguish conserved functional sequences from species-specific neutral sequence evolution across 500 million years of vertebrate diversity.
Captures regulatory elements beyond alignment: Identifies transcription factor binding motifs and RNA-binding protein recognition sequences conserved across evolutionary distances where pairwise sequence alignment fails, accessing deep evolutionary signal inaccessible to alignment-based conservation metrics.
Conservation-aware masking and prediction: By conditioning on species, the model predicts masked nucleotides using both local sequence context and evolutionary context, producing representations that implicitly weight evolutionarily constrained positions more heavily.
Improved gene expression prediction: Species-aware representations yield improved performance on endogenous gene expression prediction and massively parallel reporter assay (MPRA)-based regulatory activity prediction compared to single-species or species-agnostic training.
Motif discovery and evolutionary analysis: Enables systematic analysis of how transcription factor binding motif sequences and positional constraints have evolved across species, going beyond binary conservation/divergence assessments.
In vivo binding discrimination: Model predictions distinguish transcription factor motif instances that are occupied in vivo (by ChIP-seq) from unoccupied instances at similar sequences — a biologically meaningful feature not achievable by sequence alone without evolutionary context.

Technical Details

The species-aware DNA language model uses a BERT-style masked language model architecture, with nucleotide k-mers as tokens and standard transformer encoder blocks with multi-head self-attention. Species identity is incorporated via a learned species embedding that is added to the input representation at each position, analogous to sentence-type embeddings in BERT. Training was conducted on genomic sequences from over 800 vertebrate species spanning mammals, birds, reptiles, amphibians, and fish, assembled from public genome repositories including Ensembl and UCSC. For each genomic region used in training, orthologous sequences across species were used, with species embeddings distinguishing the species of origin. The masked language modeling objective randomly masks 15% of nucleotide tokens and trains the model to reconstruct them from context. Multiple model variants were trained to evaluate the effect of species awareness: species-aware models (with species embeddings), species-agnostic models (single model trained on multi-species data without conditioning), and single-species models trained on the human genome only. In benchmarks on regulatory element prediction, species-aware models consistently outperformed both species-agnostic and single-species baselines. On MPRA-based regulatory activity prediction, species-aware representations improved Pearson correlation by several percentage points over human-only baselines. On motif discovery, species-aware models produced higher-quality position weight matrices that better matched known JASPAR binding site profiles. The published version in Genome Biology in 2024 provides additional validation and refined benchmarks not available in the preprint.

Applications

The species-aware DNA language model has direct applications in regulatory genomics, evolutionary biology, and noncoding variant interpretation. For regulatory element discovery, the model's evolutionary-context-aware representations identify functional regulatory sequences that lack strong conservation in pairwise alignments — a class of elements that has historically been invisible to alignment-based regulatory annotation pipelines. Gene expression prediction from sequence is a validated downstream application: models fine-tuned from species-aware representations achieve higher accuracy on MPRA datasets and expression prediction benchmarks than those starting from human-only or species-agnostic pre-training. For noncoding variant interpretation, the model's sensitivity to evolutionarily constrained positions translates to improved scoring of regulatory variants from GWAS and clinical resequencing studies. The model is also valuable for comparative genomics studies that seek to understand how regulatory sequences have diverged across vertebrate evolution, and for identifying which TF binding site sequences have been conserved versus redesigned during lineage-specific regulatory evolution.

Impact

The species-aware DNA language model contributed an important conceptual advance to the genomic foundation model field: explicit modeling of evolutionary context as a form of multi-species self-supervision yields representations that encode functional conservation signals inaccessible to single-species training. The systematic comparison of species-aware versus species-agnostic models provided clear empirical evidence for the benefit of this approach, influencing subsequent multi-species genomic model designs. Publication in Genome Biology in 2024 provided peer-reviewed validation of the key claims. The work connects to a broader trend in genomic deep learning of using evolutionary information — whether through multiple sequence alignments (as in Evo), conservation scores (as in phyloP features), or multi-species masked language modeling — as a powerful inductive bias for learning biologically meaningful sequence representations. Limitations include the focus on vertebrate genomes (lacking coverage of plant and fungal regulatory sequences that may operate under different evolutionary constraints) and the computational cost of training on 800 species that may limit accessibility for groups without large-scale computing infrastructure.

Overview

Key Features

Species conditioning with 800+ species: Conditions on species identity via learned embeddings during masked language modeling, enabling the model to distinguish conserved functional sequences from species-specific neutral sequence evolution across 500 million years of vertebrate diversity.

Captures regulatory elements beyond alignment: Identifies transcription factor binding motifs and RNA-binding protein recognition sequences conserved across evolutionary distances where pairwise sequence alignment fails, accessing deep evolutionary signal inaccessible to alignment-based conservation metrics.

Conservation-aware masking and prediction: By conditioning on species, the model predicts masked nucleotides using both local sequence context and evolutionary context, producing representations that implicitly weight evolutionarily constrained positions more heavily.

Improved gene expression prediction: Species-aware representations yield improved performance on endogenous gene expression prediction and massively parallel reporter assay (MPRA)-based regulatory activity prediction compared to single-species or species-agnostic training.

Motif discovery and evolutionary analysis: Enables systematic analysis of how transcription factor binding motif sequences and positional constraints have evolved across species, going beyond binary conservation/divergence assessments.

In vivo binding discrimination: Model predictions distinguish transcription factor motif instances that are occupied in vivo (by ChIP-seq) from unoccupied instances at similar sequences — a biologically meaningful feature not achievable by sequence alone without evolutionary context.

Technical Details

Applications

Impact

Species-Aware DNA LM

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

Species-Aware DNA LM

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources