Overview

The Nucleotide Transformer is a family of DNA foundation models developed by InstaDeep in collaboration with NVIDIA and the Technical University of Munich. First released as a preprint in January 2023 and subsequently published in Nature Methods in November 2024, the work introduced a suite of large-scale transformer models trained via masked language modeling on diverse genomic sequences — from individual human reference genomes to DNA spanning 850 species across diverse phyla.

The central challenge the Nucleotide Transformer addresses is the difficulty of learning from biological sequence data in low-data regimes. Regulatory genomics, splice site prediction, and variant effect scoring all rely on labeled datasets that are far smaller than what standard supervised methods require. By pre-training on vast amounts of unlabeled DNA, the Nucleotide Transformer learns generalizable representations that can be adapted to downstream genomic tasks through fine-tuning or zero-shot transfer.

The initial model family (v1) comprises four variants: a 500M-parameter model trained on the human reference genome, a 500M-parameter model and a 2.5B-parameter model both trained on 3,202 genetically diverse human genomes from the 1000 Genomes Project, and a 2.5B-parameter multispecies model trained on DNA from 850 organisms. A second generation (v2) was subsequently released with architectural improvements and models ranging from 50M to 500M parameters, supporting longer context windows and more efficient training.

Key Features

Multi-scale model family: Four v1 variants spanning 500M to 2.5B parameters, trained on distinct datasets (human reference, population-diverse human genomes, and multispecies), give users flexibility to trade off between parameter efficiency and cross-species generalization.
Masked language modeling pre-training: Models are trained using BERT-style masked language modeling with 15% token masking on six-mer tokenized DNA sequences, enabling the model to learn contextual representations without any labeled supervision.
Attention-based interpretability: Without explicit supervision, the models learn to focus attention on functionally significant genomic elements including enhancers and splice sites, providing a built-in mechanism for understanding model reasoning.
Cross-task transfer: Representations transfer effectively across 18 diverse genomic prediction benchmarks, with the 2.5B multispecies model matching or surpassing specialized baselines on 18 of 18 tasks through fine-tuning.
Improved v2 architecture: The second generation replaced learned positional encodings with rotary embeddings (RoPE) and introduced SwiGLU activations without bias, extending the effective context window from 6 kb to 12 kb while improving training efficiency.
Fully open-source: All model weights are available on HuggingFace and code is released under a permissive license, enabling broad community adoption.

Technical Details

Both v1 and v2 variants are encoder-only transformers, architecturally related to ESM-1b, pre-trained with masked language modeling. DNA is tokenized using six-mer tokens — a design choice that balances sequence compression with the granularity needed to represent regulatory signals. The v1 models use 6-mer tokenization with a 6 kb context window; v2 models extend this to a 12 kb context window using rotary positional embeddings and SwiGLU gated linear units. The 2.5B multispecies v1 model was trained on 174 billion nucleotides from 850 species using 128 NVIDIA A100 GPUs across 16 nodes, consuming approximately 300 billion tokens per training run. Larger v2 models were trained on up to 1 trillion tokens.

Benchmark evaluation covered 18 downstream genomic tasks including chromatin profile prediction, splice site detection, enhancer activity prediction, and regulatory element classification. On this suite, the multispecies 2.5B model matched or exceeded specialized methods on all 18 tasks after fine-tuning, compared favorably to DNABERT-2, HyenaDNA, and Enformer, and achieved 95% top-k accuracy and 0.98 precision-recall AUC on splice site prediction. The models were also evaluated in zero-shot and few-shot settings, where their representations alone matched specialist baselines on 11 of 18 tasks.

Applications

The Nucleotide Transformer is well-suited for researchers working on regulatory genomics, variant prioritization, and genome annotation where labeled data is scarce. Specific use cases include predicting the effects of non-coding variants on gene expression, identifying splice-altering mutations, classifying enhancers and promoters, and annotating novel genomes across species. The multispecies model is particularly useful for comparative genomics and for organisms without large curated training sets. The models integrate readily into fine-tuning pipelines through the HuggingFace transformers library, making them accessible to researchers without deep ML infrastructure experience.

Impact

The Nucleotide Transformer established an early and rigorous benchmark for DNA foundation models, demonstrating that large-scale pre-training on genomic sequence data could meaningfully transfer to diverse downstream tasks — a result that was not guaranteed given the compositional and functional differences between DNA and protein sequences. The paper has driven significant downstream adoption and helped establish the emerging field of genomic language models alongside contemporaries such as DNABERT-2, HyenaDNA, and Evo. A key limitation is the 6–12 kb context window of the v1 and v2 models, which precludes modeling long-range genomic interactions over hundreds of kilobases; this was partially addressed by InstaDeep's subsequent NTv3 model (2025), which extends context to 1 Mb. The six-mer tokenization scheme also means the models operate on overlapping k-mers rather than individual nucleotides, which can complicate interpretation of single-nucleotide variant effects.

Overview

Key Features

Multi-scale model family: Four v1 variants spanning 500M to 2.5B parameters, trained on distinct datasets (human reference, population-diverse human genomes, and multispecies), give users flexibility to trade off between parameter efficiency and cross-species generalization.

Masked language modeling pre-training: Models are trained using BERT-style masked language modeling with 15% token masking on six-mer tokenized DNA sequences, enabling the model to learn contextual representations without any labeled supervision.

Attention-based interpretability: Without explicit supervision, the models learn to focus attention on functionally significant genomic elements including enhancers and splice sites, providing a built-in mechanism for understanding model reasoning.

Cross-task transfer: Representations transfer effectively across 18 diverse genomic prediction benchmarks, with the 2.5B multispecies model matching or surpassing specialized baselines on 18 of 18 tasks through fine-tuning.

Improved v2 architecture: The second generation replaced learned positional encodings with rotary embeddings (RoPE) and introduced SwiGLU activations without bias, extending the effective context window from 6 kb to 12 kb while improving training efficiency.

Fully open-source: All model weights are available on HuggingFace and code is released under a permissive license, enabling broad community adoption.

Technical Details

Applications

Impact

Nucleotide Transformer

Overview

Key Features

Technical Details

Applications

Impact

Citation