genbio.ai
A 7-billion-parameter encoder-only DNA foundation model trained on 10.6 billion nucleotides from 796 species for functional genomics and synthetic biology.
AIDO.DNA is a 7-billion-parameter DNA foundation model developed by GenBio AI as part of the AI-Driven Digital Organism (AIDO) platform — a suite of multiscale foundation models designed to simulate and program biology across molecular, cellular, and organismal levels. Released in December 2024, AIDO.DNA addresses a long-standing challenge in genomics: building a single model that achieves accurate, general-purpose representations of DNA sequences across the full diversity of biological function. Prior DNA language models have demonstrated that sequence-level pretraining can transfer to tasks such as promoter prediction, chromatin accessibility modeling, and variant effect scoring, but model scale has been limited relative to what is now routine in protein and text modeling. AIDO.DNA scales the encoder-only transformer architecture to 7 billion parameters — substantially larger than predecessors such as DNABERT-2 or the Nucleotide Transformer — and demonstrates that this increase in scale drives broad performance improvements even without expanding the training dataset.
The model takes a deliberately focused approach to architecture and training data. Unlike contemporaneous genomic models trained on many trillions of base pairs across the full tree of life (such as Evo 2), AIDO.DNA is trained on a carefully curated corpus of 10.6 billion nucleotides drawn from 796 species, covering prokaryotic and eukaryotic organisms with an emphasis on functional sequence diversity. The training corpus was assembled to maximize diversity across lineages, ensuring the model encounters a broad range of regulatory grammar, gene structures, and intergenic sequence contexts. The model uses a single-nucleotide tokenization strategy and a bidirectional masked language modeling (MLM) objective, allowing it to attend to context from both directions — an important property for understanding regulatory sequences where elements upstream and downstream of a functional element both contribute to its activity.
A central finding of the AIDO.DNA paper is that scaling model depth while maintaining a fixed, short context length of 4,000 nucleotides is sufficient to produce substantial downstream task improvements. This challenges the dominant assumption in genomic sequence modeling that longer context windows are necessary for strong performance on functional genomics benchmarks. The authors argue that inaccurate modeling of local sequence statistics — rather than insufficient global context — is the primary bottleneck limiting prior DNA language models, and that compute-optimal scaling for DNA models may follow different principles than those established for text or protein models. Models and fine-tuning code are made fully available through the AIDO.ModelGenerator framework on GitHub and Hugging Face.
AIDO.DNA uses an encoder-only transformer architecture in the BERT family, pretrained with a masked language modeling objective on 10.6 billion nucleotides. The model employs single-nucleotide tokenization with a vocabulary of five tokens (A, T, G, C, and a mask token), and operates with a maximum context length of 4,000 nucleotides per forward pass. At 7 billion parameters, it is substantially deeper than predecessor models: DNABERT-2 operates at 117 million parameters, and the largest Nucleotide Transformer variant has 2.5 billion parameters. Despite the order-of-magnitude difference in parameter count, the AIDO.DNA training corpus at 10.6 billion nucleotides is deliberately compact compared to models trained on terabase-scale databases, reflecting the authors' hypothesis that model capacity rather than data volume is the primary bottleneck at current scales.
The training corpus consists of complete assembled genomes from 796 species selected for phylogenetic diversity, spanning bacteria, archaea, fungi, plants, and animals. Tokenization operates at single-nucleotide resolution without overlapping windows, and the MLM masking rate follows the standard 15% protocol from BERT. The model architecture incorporates rotary position embeddings and SwiGLU activation functions in the feed-forward layers, aligning with best practices from current large language model development.
On the Genome Understanding Evaluation (GUE) benchmark — a standard suite for assessing DNA language model performance — AIDO.DNA-7B achieves state-of-the-art results across the majority of tasks, outperforming previous encoder-only architectures including DNABERT-2, HyenaDNA, and the Nucleotide Transformer family. The paper reports improvements on core promoter prediction, transcription factor binding site classification, splice site detection, and epigenomic peak calling tasks. Critically, AIDO.DNA outperforms prior models trained on substantially larger genomic corpora, reinforcing the argument that architectural scale rather than data scale is the current limiting factor for DNA language model performance. The model is also evaluated on sequence generation tasks using fine-tuned variants, demonstrating that the encoder backbone can be adapted for directed mutagenesis and sequence optimization workflows.
A key methodological contribution is the empirical characterization of scaling behavior specific to DNA models. The authors document that performance on functional genomics tasks increases monotonically with parameter count across the model sizes tested, but that the scaling exponents differ from those observed in text language models — suggesting that compute-optimal training recipes developed for NLP may not transfer directly to genomics. This positions AIDO.DNA not only as a practical tool but as a contribution to the theory of biological sequence modeling at scale.
AIDO.DNA is designed for the broad community of functional genomics researchers who need to annotate, predict, or design DNA sequences across regulatory, coding, and non-coding contexts. Computational biologists can use the pretrained model as a sequence encoder for supervised tasks such as promoter classification, splice site detection, chromatin state prediction, and transcription factor binding site profiling — tasks that previously required task-specific architectures or smaller language models. Synthetic biology researchers benefit from the generative and directed mutagenesis capabilities, enabling in silico exploration of sequence space for promoter engineering, regulatory element optimization, and the design of novel synthetic gene circuits. Variant effect prediction workflows can leverage the model's zero-shot scoring capability, applying masked language model log-likelihood differences to prioritize non-coding variants in human genetics studies without requiring labeled training data specific to the locus of interest. The AIDO.ModelGenerator framework makes these applications accessible through a standardized fine-tuning interface, allowing wet-lab researchers and smaller computational groups to adapt AIDO.DNA without requiring custom deep learning infrastructure.
AIDO.DNA represents a significant scaling milestone for encoder-only DNA foundation models and establishes an important empirical result: that architectural scale can overcome data scale limitations in DNA sequence modeling, at least in the current regime. The demonstration that a 7-billion-parameter model pretrained on 10.6 billion nucleotides outperforms models trained on far larger corpora challenges prevailing assumptions about what drives performance in genomic foundation models and has implications for how the community should allocate compute resources in future model development. As a component of the broader AIDO multiscale platform — which also includes foundation models for RNA (AIDO.RNA), proteins (AIDO.Protein), and single cells (AIDO.Cell) — AIDO.DNA is positioned as a modular building block for the longer-term goal of simulating biology at multiple scales within a coherent computational framework. The open release of model weights in multiple sizes (300M and 7B), coupled with the AIDO.ModelGenerator fine-tuning infrastructure, provides the research community with an immediately practical tool. Limitations to note include the 4,000-nucleotide context window, which precludes modeling long-range regulatory interactions that can span tens of kilobases, and the training corpus size, which remains modest compared to Evo 2's 9.3 trillion base pair dataset — a trade-off that may limit performance on tasks requiring deep cross-species evolutionary context.