Overview

AIDO.DNA is a 7-billion-parameter DNA foundation model developed by GenBio AI as part of the AI-Driven Digital Organism (AIDO) platform — a suite of multiscale foundation models designed to simulate and program biology across molecular, cellular, and organismal levels. Released in December 2024, AIDO.DNA addresses a long-standing challenge in genomics: building a single model that achieves accurate, general-purpose representations of DNA sequences across the full diversity of biological function. Prior DNA language models have demonstrated that sequence-level pretraining can transfer to tasks such as promoter prediction, chromatin accessibility modeling, and variant effect scoring, but model scale has been limited relative to what is now routine in protein and text modeling. AIDO.DNA scales the encoder-only transformer architecture to 7 billion parameters — substantially larger than predecessors such as DNABERT-2 or the Nucleotide Transformer — and demonstrates that this increase in scale drives broad performance improvements even without expanding the training dataset.

The model takes a deliberately focused approach to architecture and training data. Unlike contemporaneous genomic models trained on many trillions of base pairs across the full tree of life (such as Evo 2), AIDO.DNA is trained on a carefully curated corpus of 10.6 billion nucleotides drawn from 796 species, covering prokaryotic and eukaryotic organisms with an emphasis on functional sequence diversity. The training corpus was assembled to maximize diversity across lineages, ensuring the model encounters a broad range of regulatory grammar, gene structures, and intergenic sequence contexts. The model uses a single-nucleotide tokenization strategy and a bidirectional masked language modeling (MLM) objective, allowing it to attend to context from both directions — an important property for understanding regulatory sequences where elements upstream and downstream of a functional element both contribute to its activity.

A central finding of the AIDO.DNA paper is that scaling model depth while maintaining a fixed, short context length of 4,000 nucleotides is sufficient to produce substantial downstream task improvements. This challenges the dominant assumption in genomic sequence modeling that longer context windows are necessary for strong performance on functional genomics benchmarks. The authors argue that inaccurate modeling of local sequence statistics — rather than insufficient global context — is the primary bottleneck limiting prior DNA language models, and that compute-optimal scaling for DNA models may follow different principles than those established for text or protein models. Models and fine-tuning code are made fully available through the AIDO.ModelGenerator framework on GitHub and Hugging Face.

Key Features

7-billion-parameter scale: AIDO.DNA is the largest encoder-only DNA language model published as of its December 2024 release, and the first to demonstrate clear and broad scaling benefits for DNA sequence modeling without requiring new training data or increased context length.
Single-nucleotide tokenization: The model processes DNA at individual base-pair resolution rather than using k-mer tokenization, preserving fine-grained sequence information critical for capturing point mutations, splice signals, and transcription factor binding motifs with single-nucleotide precision.
Bidirectional masked language modeling: Pretraining with the MLM objective enables the model to incorporate both upstream and downstream sequence context when building representations, making it well suited for tasks where regulatory context is distributed across a genomic window.
Cross-species generalization from 796 species: Training on genomes from organisms spanning wide evolutionary distances ensures the model learns conserved functional grammar shared across taxa rather than overfitting to human or model-organism idiosyncrasies.
Broad task coverage: AIDO.DNA is evaluated on and achieves improvements across supervised downstream tasks (promoter classification, splice site detection, gene expression prediction), generative tasks (directed sequence design), and zero-shot functional annotation — demonstrating that a single model backbone generalizes across this diverse task landscape.
AIDO.ModelGenerator integration: Fine-tuning and downstream adaptation is supported through a standardized software stack that enables researchers to adapt the pretrained model to new genomic tasks without custom training pipelines, lowering the barrier to entry for non-ML laboratories.

Technical Details

AIDO.DNA uses an encoder-only transformer architecture in the BERT family, pretrained with a masked language modeling objective on 10.6 billion nucleotides. The model employs single-nucleotide tokenization with a vocabulary of five tokens (A, T, G, C, and a mask token), and operates with a maximum context length of 4,000 nucleotides per forward pass. At 7 billion parameters, it is substantially deeper than predecessor models: DNABERT-2 operates at 117 million parameters, and the largest Nucleotide Transformer variant has 2.5 billion parameters. Despite the order-of-magnitude difference in parameter count, the AIDO.DNA training corpus at 10.6 billion nucleotides is deliberately compact compared to models trained on terabase-scale databases, reflecting the authors' hypothesis that model capacity rather than data volume is the primary bottleneck at current scales.

The training corpus consists of complete assembled genomes from 796 species selected for phylogenetic diversity, spanning bacteria, archaea, fungi, plants, and animals. Tokenization operates at single-nucleotide resolution without overlapping windows, and the MLM masking rate follows the standard 15% protocol from BERT. The model architecture incorporates rotary position embeddings and SwiGLU activation functions in the feed-forward layers, aligning with best practices from current large language model development.

On the Genome Understanding Evaluation (GUE) benchmark — a standard suite for assessing DNA language model performance — AIDO.DNA-7B achieves state-of-the-art results across the majority of tasks, outperforming previous encoder-only architectures including DNABERT-2, HyenaDNA, and the Nucleotide Transformer family. The paper reports improvements on core promoter prediction, transcription factor binding site classification, splice site detection, and epigenomic peak calling tasks. Critically, AIDO.DNA outperforms prior models trained on substantially larger genomic corpora, reinforcing the argument that architectural scale rather than data scale is the current limiting factor for DNA language model performance. The model is also evaluated on sequence generation tasks using fine-tuned variants, demonstrating that the encoder backbone can be adapted for directed mutagenesis and sequence optimization workflows.

A key methodological contribution is the empirical characterization of scaling behavior specific to DNA models. The authors document that performance on functional genomics tasks increases monotonically with parameter count across the model sizes tested, but that the scaling exponents differ from those observed in text language models — suggesting that compute-optimal training recipes developed for NLP may not transfer directly to genomics. This positions AIDO.DNA not only as a practical tool but as a contribution to the theory of biological sequence modeling at scale.

Applications

AIDO.DNA is designed for the broad community of functional genomics researchers who need to annotate, predict, or design DNA sequences across regulatory, coding, and non-coding contexts. Computational biologists can use the pretrained model as a sequence encoder for supervised tasks such as promoter classification, splice site detection, chromatin state prediction, and transcription factor binding site profiling — tasks that previously required task-specific architectures or smaller language models. Synthetic biology researchers benefit from the generative and directed mutagenesis capabilities, enabling in silico exploration of sequence space for promoter engineering, regulatory element optimization, and the design of novel synthetic gene circuits. Variant effect prediction workflows can leverage the model's zero-shot scoring capability, applying masked language model log-likelihood differences to prioritize non-coding variants in human genetics studies without requiring labeled training data specific to the locus of interest. The AIDO.ModelGenerator framework makes these applications accessible through a standardized fine-tuning interface, allowing wet-lab researchers and smaller computational groups to adapt AIDO.DNA without requiring custom deep learning infrastructure.

Impact

AIDO.DNA represents a significant scaling milestone for encoder-only DNA foundation models and establishes an important empirical result: that architectural scale can overcome data scale limitations in DNA sequence modeling, at least in the current regime. The demonstration that a 7-billion-parameter model pretrained on 10.6 billion nucleotides outperforms models trained on far larger corpora challenges prevailing assumptions about what drives performance in genomic foundation models and has implications for how the community should allocate compute resources in future model development. As a component of the broader AIDO multiscale platform — which also includes foundation models for RNA (AIDO.RNA), proteins (AIDO.Protein), and single cells (AIDO.Cell) — AIDO.DNA is positioned as a modular building block for the longer-term goal of simulating biology at multiple scales within a coherent computational framework. The open release of model weights in multiple sizes (300M and 7B), coupled with the AIDO.ModelGenerator fine-tuning infrastructure, provides the research community with an immediately practical tool. Limitations to note include the 4,000-nucleotide context window, which precludes modeling long-range regulatory interactions that can span tens of kilobases, and the training corpus size, which remains modest compared to Evo 2's 9.3 trillion base pair dataset — a trade-off that may limit performance on tasks requiring deep cross-species evolutionary context.

Overview

Key Features

7-billion-parameter scale: AIDO.DNA is the largest encoder-only DNA language model published as of its December 2024 release, and the first to demonstrate clear and broad scaling benefits for DNA sequence modeling without requiring new training data or increased context length.

Single-nucleotide tokenization: The model processes DNA at individual base-pair resolution rather than using k-mer tokenization, preserving fine-grained sequence information critical for capturing point mutations, splice signals, and transcription factor binding motifs with single-nucleotide precision.

Bidirectional masked language modeling: Pretraining with the MLM objective enables the model to incorporate both upstream and downstream sequence context when building representations, making it well suited for tasks where regulatory context is distributed across a genomic window.

Cross-species generalization from 796 species: Training on genomes from organisms spanning wide evolutionary distances ensures the model learns conserved functional grammar shared across taxa rather than overfitting to human or model-organism idiosyncrasies.

Broad task coverage: AIDO.DNA is evaluated on and achieves improvements across supervised downstream tasks (promoter classification, splice site detection, gene expression prediction), generative tasks (directed sequence design), and zero-shot functional annotation — demonstrating that a single model backbone generalizes across this diverse task landscape.

AIDO.ModelGenerator integration: Fine-tuning and downstream adaptation is supported through a standardized software stack that enables researchers to adapt the pretrained model to new genomic tasks without custom training pipelines, lowering the barrier to entry for non-ML laboratories.

Technical Details

Applications

Impact

AIDO.DNA

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

AIDO.DNA

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources