A GPT-based foundation model pre-trained on 200B+ base pairs from mammalian genomes, supporting DNA sequence generation, classification, and regression.
DNAGPT is a generalized pre-trained transformer model developed by Tencent AI Lab Healthcare that adapts the GPT architecture for DNA sequence analysis. Released in July 2023, it addresses a persistent gap in genomic AI: most existing models are designed for a single task and require task-specific architectures, limiting their transferability. DNAGPT instead trains a single foundation model on a massive corpus of mammalian genomic data and fine-tunes it across diverse downstream applications.
The model is pre-trained on over 200 billion base pairs sourced from mammalian genomes. This scale positions DNAGPT among the largest DNA language models trained to date and gives it broad coverage of genomic sequence variation across species. To adapt the GPT architecture for genomics, the authors introduced three complementary pre-training objectives: standard autoregressive next-token prediction, binary classification of DNA sequence ordering, and numerical regression to predict guanine-cytosine (GC) content. Together these objectives encourage the model to learn both local sequence patterns and global sequence-level properties.
DNAGPT is released in multiple configurations — 0.1B and 3B parameter variants trained on human-only or multi-organism datasets — giving users the flexibility to trade off computational cost against model capacity and cross-species generalization.
<R> for human) allow the multi-organism variants to condition predictions on species identity, enabling cross-species transfer learning within a single model.DNAGPT is built on the GPT transformer decoder architecture, extended with a custom tokenization scheme that encodes DNA nucleotides alongside numerical genomic features and organism-identity tokens. Pre-training employs three joint objectives: autoregressive language modeling over nucleotide sequences, a binary classification task for distinguishing correctly ordered versus shuffled DNA subsequences, and a regression task predicting GC content as a continuous value. This multi-objective pre-training was designed to push the model beyond simple n-gram statistics and toward an understanding of sequence composition and ordering at multiple scales.
Three primary model configurations are available. DNAGPT-0.1B-Human is trained exclusively on the human genome with a maximum context length of 3,060 base pairs. DNAGPT-0.1B-Multi and DNAGPT-3B-Multi are trained on multi-organism mammalian genomes; the 0.1B multi-organism variant extends the context window to 24,564 base pairs, while the 3B variant retains the 3,060 base pair limit. Benchmarking reported in the preprint shows DNAGPT outperforming task-specific models on genomic signal recognition (including AATAAA polyadenylation signal classification), mRNA abundance regression, and artificial genome generation, validating the effectiveness of the generalized pre-training strategy.
DNAGPT is applicable to a broad range of genomic analysis workflows. Regulatory genomics researchers can fine-tune the model for classification of functional elements such as transcription factor binding sites, enhancers, and splice sites. Transcriptomics groups can use the regression capabilities to predict mRNA abundance from sequence context, enabling hypothesis generation about expression-modulating variants. The generative mode allows synthetic biologists to produce novel DNA sequences conditioned on species identity and learned sequence statistics, which is useful for designing synthetic regulatory elements or stress-testing annotation pipelines with plausible artificial sequences. Comparative genomics applications benefit from the multi-organism models, which can transfer knowledge from well-characterized genomes to less-studied mammalian species.
DNAGPT contributes to the growing body of work demonstrating that large autoregressive language models originally developed for natural language can serve as effective foundation models for genomics when adapted with domain-specific pre-training objectives and tokenization. Its release alongside pre-trained weights for multiple model sizes lowers the barrier for groups without large-scale compute to leverage foundation model approaches for DNA analysis. The multi-task framing — where a single model performs generation, classification, and regression — offers a more economical approach than training separate specialist models. A key limitation is that the model operates on raw sequence only and does not incorporate chromatin accessibility, methylation, or other epigenomic context that is often predictive of regulatory activity. Context length also constrains analysis of very long genomic regions for most configurations, a practical limitation shared with many transformer-based genomic models.
Zhang, D., Zhang, W., Zhao, Y., Zhang, J., He, B., Qin, C., & Yao, J. (2023). DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks. arXiv preprint arXiv:2307.05628.
DOI: 10.48550/arXiv.2307.05628