DNAGPT2 is a family of ten compact, decoder-only transformer language models for DNA, developed at CEITEC Masaryk University by Vojtech Macala and Petr Simecek and released as a bioRxiv preprint in June 2026. Each model is a GPT-2-small architecture (roughly 86 to 92 million parameters) trained autoregressively to predict the next token in a genomic sequence. The work's primary framing is lossless DNA compression: because an autoregressive model assigns a probability to every next token, it can drive an arithmetic coder, and the better the model predicts the genome, the fewer bits are needed to store it.

The defining experimental variable across the family is the byte-pair encoding (BPE) vocabulary size, which ranges from 16 to 8192 tokens (16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192). Holding architecture, training data, and compute fixed while sweeping vocabulary size isolates the effect of tokenization granularity on how well a small DNA model compresses and predicts sequence — a question that larger genomic foundation models rarely study in a controlled way. The study finds that small vocabularies, not large ones, give the best compression: the DNAGPT2_32 variant (32-token vocabulary) reaches the strongest result.

DNAGPT2 sits at the intersection of genomic language modeling and the long-standing field of reference-free DNA compression. Rather than competing with large multi-species foundation models on downstream classification, it shows that a single-GPU, GPT-2-scale model with carefully chosen tokenization is a competitive, practical DNA compressor.

Key Features

Vocabulary-size sweep: Ten otherwise-identical models span BPE vocabularies from 16 to 8192 tokens, providing a controlled study of how tokenization granularity affects DNA prediction and compression.
Competitive lossless compression: The best variant, DNAGPT2_32, reaches 1.47 bits per base on the complete T2T human genome, placing 4th on the Cobilab DNA compression benchmark and well ahead of general-purpose compressors such as gzip (about 2.0 bits/base).
Arithmetic-coding pipeline: Each model's next-token probabilities feed an arithmetic encoder, turning the language model directly into a lossless compressor.
Small and reproducible: Every model was trained for a single epoch on one NVIDIA A40 GPU, making the entire family inexpensive to reproduce.
Fully open weights: All ten pretrained checkpoints are public on HuggingFace under the Apache-2.0 license.

Technical Details

Each DNAGPT2 model uses the GPT-2-small configuration: 12 transformer layers, 12 attention heads, 768-dimensional embeddings, and a 1024-token context window. Tokenization is byte-pair encoding via SentencePiece, with vocabulary the sole varied hyperparameter (16 to 8192). The models were pretrained for one epoch on the DNABERT-2 multi-species corpus — 135 genomes spanning Vertebrata, Fungi, Protozoa, Invertebrata, and Bacteria, totaling roughly 32.5 billion nucleotides restricted to A, C, G, and T. Training followed the nanoGPT recipe in PyTorch with the AdamW optimizer (betas 0.9/0.95, weight decay 0.1), a cosine learning-rate schedule decaying from 8e-4 to 8e-5 with linear warmup, and a batch size of 2^19 tokens per step, all on a single A40. Evaluated as compressors, smaller vocabularies performed best: DNAGPT2_32 achieved 1.47 bits/base on the T2T human assembly (4th on the Cobilab benchmark), with comparable gains over gzip on bacterial and plant test sequences. Broad downstream-task evaluation (variant effect, regulatory element classification, and similar) is limited so far, consistent with the paper's focus on compression.

Applications

DNAGPT2 is most directly useful for lossless genomic data compression, where storing and transmitting large assemblies efficiently is a practical bottleneck for sequencing centers, biobanks, and bioinformatics pipelines. Because the models are small, openly licensed, and trivially reproducible on a single GPU, they also serve as an accessible testbed for researchers studying tokenization choices in genomic language models, or as lightweight autoregressive baselines for next-nucleotide prediction. Their modest size makes them suitable for teaching and for rapid experimentation where large foundation models would be unwieldy.

Impact

DNAGPT2's main contribution is a clean, controlled demonstration that tokenization granularity is a first-order design choice for DNA language models — and that, counter to intuition from natural-language modeling, smaller BPE vocabularies yield better compression for genomes. By placing a single-GPU, GPT-2-scale model 4th on the Cobilab compression benchmark, the work shows that competitive DNA compressors need not be large. The release of all ten weights under Apache-2.0 makes the comparison fully reproducible and gives the community a concrete set of small baselines. The principal limitation is scope: the family is optimized and evaluated chiefly for compression and next-token prediction, and its usefulness on the broader range of genomic downstream tasks has not yet been established. No dedicated public training-code repository was found at release.

Citation

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Máčala, V. & Šimeček, P. (2026) DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map. bioRxiv.

DOI: 10.64898/2026.06.10.731316

Key Features

Vocabulary-size sweep: Ten otherwise-identical models span BPE vocabularies from 16 to 8192 tokens, providing a controlled study of how tokenization granularity affects DNA prediction and compression.

Competitive lossless compression: The best variant, DNAGPT2_32, reaches 1.47 bits per base on the complete T2T human genome, placing 4th on the Cobilab DNA compression benchmark and well ahead of general-purpose compressors such as gzip (about 2.0 bits/base).

Arithmetic-coding pipeline: Each model's next-token probabilities feed an arithmetic encoder, turning the language model directly into a lossless compressor.

Small and reproducible: Every model was trained for a single epoch on one NVIDIA A40 GPU, making the entire family inexpensive to reproduce.

Fully open weights: All ten pretrained checkpoints are public on HuggingFace under the Apache-2.0 license.

Technical Details

Applications

Impact

Citation

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Máčala, V. & Šimeček, P. (2026) DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map. bioRxiv.

DOI: 10.64898/2026.06.10.731316

DNAGPT2

Key Features

Technical Details

Applications

Impact

Citation

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

DNAGPT2

Key Features

Technical Details

Applications

Impact

Citation

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

DNAGPT2

#Key Features

#Technical Details

#Applications

#Impact

Citation

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

DNAGPT2

#Key Features

#Technical Details

#Applications

#Impact

Citation

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact