Overview

DNAGPT is a generalized pre-trained transformer model developed by Tencent AI Lab Healthcare that adapts the GPT architecture for DNA sequence analysis. Released in July 2023, it addresses a persistent gap in genomic AI: most existing models are designed for a single task and require task-specific architectures, limiting their transferability. DNAGPT instead trains a single foundation model on a massive corpus of mammalian genomic data and fine-tunes it across diverse downstream applications.

The model is pre-trained on over 200 billion base pairs sourced from mammalian genomes. This scale positions DNAGPT among the largest DNA language models trained to date and gives it broad coverage of genomic sequence variation across species. To adapt the GPT architecture for genomics, the authors introduced three complementary pre-training objectives: standard autoregressive next-token prediction, binary classification of DNA sequence ordering, and numerical regression to predict guanine-cytosine (GC) content. Together these objectives encourage the model to learn both local sequence patterns and global sequence-level properties.

DNAGPT is released in multiple configurations — 0.1B and 3B parameter variants trained on human-only or multi-organism datasets — giving users the flexibility to trade off computational cost against model capacity and cross-species generalization.

Key Features

Generalized multi-task architecture: A single model backbone handles sequence generation, binary classification, and numerical regression tasks without requiring separate architectures for each, reducing the overhead of adapting to new genomic problems.
Massive mammalian pre-training corpus: Pre-training on over 200 billion base pairs from mammalian genomes gives the model broad exposure to genomic sequence diversity, including coding regions, regulatory elements, and intergenic sequences.
Multi-organism support with organism tokens: Special organism tokens (e.g., <R> for human) allow the multi-organism variants to condition predictions on species identity, enabling cross-species transfer learning within a single model.
Extended sequence length: The 0.1B multi-organism variant supports input sequences up to 24,564 base pairs, accommodating analysis of long genomic loci that exceed the capacity of many competing models.
Multiple model scales: Configurations range from 0.1B to 3B parameters, allowing deployment on modest hardware for exploratory work or scaling up for performance-critical applications.

Technical Details

DNAGPT is built on the GPT transformer decoder architecture, extended with a custom tokenization scheme that encodes DNA nucleotides alongside numerical genomic features and organism-identity tokens. Pre-training employs three joint objectives: autoregressive language modeling over nucleotide sequences, a binary classification task for distinguishing correctly ordered versus shuffled DNA subsequences, and a regression task predicting GC content as a continuous value. This multi-objective pre-training was designed to push the model beyond simple n-gram statistics and toward an understanding of sequence composition and ordering at multiple scales.

Three primary model configurations are available. DNAGPT-0.1B-Human is trained exclusively on the human genome with a maximum context length of 3,060 base pairs. DNAGPT-0.1B-Multi and DNAGPT-3B-Multi are trained on multi-organism mammalian genomes; the 0.1B multi-organism variant extends the context window to 24,564 base pairs, while the 3B variant retains the 3,060 base pair limit. Benchmarking reported in the preprint shows DNAGPT outperforming task-specific models on genomic signal recognition (including AATAAA polyadenylation signal classification), mRNA abundance regression, and artificial genome generation, validating the effectiveness of the generalized pre-training strategy.

Applications

DNAGPT is applicable to a broad range of genomic analysis workflows. Regulatory genomics researchers can fine-tune the model for classification of functional elements such as transcription factor binding sites, enhancers, and splice sites. Transcriptomics groups can use the regression capabilities to predict mRNA abundance from sequence context, enabling hypothesis generation about expression-modulating variants. The generative mode allows synthetic biologists to produce novel DNA sequences conditioned on species identity and learned sequence statistics, which is useful for designing synthetic regulatory elements or stress-testing annotation pipelines with plausible artificial sequences. Comparative genomics applications benefit from the multi-organism models, which can transfer knowledge from well-characterized genomes to less-studied mammalian species.

Impact

DNAGPT contributes to the growing body of work demonstrating that large autoregressive language models originally developed for natural language can serve as effective foundation models for genomics when adapted with domain-specific pre-training objectives and tokenization. Its release alongside pre-trained weights for multiple model sizes lowers the barrier for groups without large-scale compute to leverage foundation model approaches for DNA analysis. The multi-task framing — where a single model performs generation, classification, and regression — offers a more economical approach than training separate specialist models. A key limitation is that the model operates on raw sequence only and does not incorporate chromatin accessibility, methylation, or other epigenomic context that is often predictive of regulatory activity. Context length also constrains analysis of very long genomic regions for most configurations, a practical limitation shared with many transformer-based genomic models.

Citation

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

Preprint

Zhang, D., Zhang, W., Zhao, Y., Zhang, J., He, B., Qin, C., & Yao, J. (2023). DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks. arXiv preprint arXiv:2307.05628.

DOI: 10.48550/arXiv.2307.05628

Overview

Key Features

Generalized multi-task architecture: A single model backbone handles sequence generation, binary classification, and numerical regression tasks without requiring separate architectures for each, reducing the overhead of adapting to new genomic problems.

Massive mammalian pre-training corpus: Pre-training on over 200 billion base pairs from mammalian genomes gives the model broad exposure to genomic sequence diversity, including coding regions, regulatory elements, and intergenic sequences.

Multi-organism support with organism tokens: Special organism tokens (e.g., <R> for human) allow the multi-organism variants to condition predictions on species identity, enabling cross-species transfer learning within a single model.

Extended sequence length: The 0.1B multi-organism variant supports input sequences up to 24,564 base pairs, accommodating analysis of long genomic loci that exceed the capacity of many competing models.

Multiple model scales: Configurations range from 0.1B to 3B parameters, allowing deployment on modest hardware for exploratory work or scaling up for performance-critical applications.

Technical Details

Applications

Impact

Citation

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

Preprint

Zhang, D., Zhang, W., Zhao, Y., Zhang, J., He, B., Qin, C., & Yao, J. (2023). DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks. arXiv preprint arXiv:2307.05628.

DOI: 10.48550/arXiv.2307.05628

DNAGPT

Overview

Key Features

Technical Details

Applications

Impact

Citation

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

Metrics

GitHub

Tags

Resources

DNAGPT

Overview

Key Features

Technical Details

Applications

Impact

Citation

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

Metrics

GitHub

Tags

Resources