A family of DNA foundation models (500M–2.5B parameters) trained on 3,200+ human genomes and 850 species for genomic sequence understanding and variant effect prediction.
The Nucleotide Transformer is a family of DNA foundation models developed by InstaDeep in collaboration with NVIDIA and the Technical University of Munich. First released as a preprint in January 2023 and subsequently published in Nature Methods in November 2024, the work introduced a suite of large-scale transformer models trained via masked language modeling on diverse genomic sequences — from individual human reference genomes to DNA spanning 850 species across diverse phyla.
The central challenge the Nucleotide Transformer addresses is the difficulty of learning from biological sequence data in low-data regimes. Regulatory genomics, splice site prediction, and variant effect scoring all rely on labeled datasets that are far smaller than what standard supervised methods require. By pre-training on vast amounts of unlabeled DNA, the Nucleotide Transformer learns generalizable representations that can be adapted to downstream genomic tasks through fine-tuning or zero-shot transfer.
The initial model family (v1) comprises four variants: a 500M-parameter model trained on the human reference genome, a 500M-parameter model and a 2.5B-parameter model both trained on 3,202 genetically diverse human genomes from the 1000 Genomes Project, and a 2.5B-parameter multispecies model trained on DNA from 850 organisms. A second generation (v2) was subsequently released with architectural improvements and models ranging from 50M to 500M parameters, supporting longer context windows and more efficient training.
Both v1 and v2 variants are encoder-only transformers, architecturally related to ESM-1b, pre-trained with masked language modeling. DNA is tokenized using six-mer tokens — a design choice that balances sequence compression with the granularity needed to represent regulatory signals. The v1 models use 6-mer tokenization with a 6 kb context window; v2 models extend this to a 12 kb context window using rotary positional embeddings and SwiGLU gated linear units. The 2.5B multispecies v1 model was trained on 174 billion nucleotides from 850 species using 128 NVIDIA A100 GPUs across 16 nodes, consuming approximately 300 billion tokens per training run. Larger v2 models were trained on up to 1 trillion tokens.
Benchmark evaluation covered 18 downstream genomic tasks including chromatin profile prediction, splice site detection, enhancer activity prediction, and regulatory element classification. On this suite, the multispecies 2.5B model matched or exceeded specialized methods on all 18 tasks after fine-tuning, compared favorably to DNABERT-2, HyenaDNA, and Enformer, and achieved 95% top-k accuracy and 0.98 precision-recall AUC on splice site prediction. The models were also evaluated in zero-shot and few-shot settings, where their representations alone matched specialist baselines on 11 of 18 tasks.
The Nucleotide Transformer is well-suited for researchers working on regulatory genomics, variant prioritization, and genome annotation where labeled data is scarce. Specific use cases include predicting the effects of non-coding variants on gene expression, identifying splice-altering mutations, classifying enhancers and promoters, and annotating novel genomes across species. The multispecies model is particularly useful for comparative genomics and for organisms without large curated training sets. The models integrate readily into fine-tuning pipelines through the HuggingFace transformers library, making them accessible to researchers without deep ML infrastructure experience.
The Nucleotide Transformer established an early and rigorous benchmark for DNA foundation models, demonstrating that large-scale pre-training on genomic sequence data could meaningfully transfer to diverse downstream tasks — a result that was not guaranteed given the compositional and functional differences between DNA and protein sequences. The paper has driven significant downstream adoption and helped establish the emerging field of genomic language models alongside contemporaries such as DNABERT-2, HyenaDNA, and Evo. A key limitation is the 6–12 kb context window of the v1 and v2 models, which precludes modeling long-range genomic interactions over hundreds of kilobases; this was partially addressed by InstaDeep's subsequent NTv3 model (2025), which extends context to 1 Mb. The six-mer tokenization scheme also means the models operate on overlapping k-mers rather than individual nucleotides, which can complicate interpretation of single-nucleotide variant effects.
Dalla-torre, H., et al. (2024) Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nature Methods.
DOI: 10.1038/s41592-024-02523-z