PatchDNA

DNA language model that replaces fixed tokenization with conservation-guided patching, letting models up to 10x smaller match top genomic benchmarks.

Released: March 2026

Parameters: 19.2 Million

DNA language models inherit a design choice from natural-language processing: tokenization. Whether using single nucleotides, fixed k-mers, or byte-pair encoding, these schemes are decided before training and frozen into the model, often splitting the genome in ways that ignore biology and forcing larger models and longer context windows to compensate. PatchDNA argues that this fixed tokenization is a bottleneck and replaces it with a flexible, biologically informed alternative called patching.

Developed at Relation Therapeutics and presented as an ICLR 2026 paper (with a bioRxiv preprint), PatchDNA segments DNA into contiguous, variable-length patches rather than a fixed vocabulary of tokens. During pretraining, patch boundaries are guided by evolutionary conservation scores, concentrating the model's capacity on functionally important, conserved regions while compressing less informative stretches. Crucially, because patching is a preprocessing strategy rather than a learned vocabulary, the boundaries can be changed at inference time without retraining the model—a flexibility that fixed tokenizers cannot offer.

The result is a striking efficiency story: PatchDNA reports that models up to an order of magnitude smaller than current systems match or surpass state-of-the-art performance on established DNA benchmarks, building on the byte-latent-transformer line of work while grounding patch boundaries in genomic conservation.

Key Features

Conservation-guided patching: Patch boundaries are placed using evolutionary conservation scores, focusing model capacity on functionally important regions instead of arbitrary fixed tokens.
Tokenization-free flexibility: Patching replaces a frozen vocabulary, so the segmentation scheme can be altered at inference time without any retraining.
Extreme parameter efficiency: Models up to 10x smaller than prior approaches reach or exceed state-of-the-art accuracy on DNA benchmarks.
Long-range context: A 7.7M-parameter variant operates over a 131 kbp context window, enabling whole-locus modeling at very low parameter cost.

Technical Details

PatchDNA pretrains transformer models that consume variable-length DNA patches whose boundaries are derived from evolutionary conservation scores. The work releases two main configurations: a 19.2M-parameter model with a 16 kbp context window and a 7.7M-parameter model with a 131 kbp context window. The smaller, long-context variant is reported to outperform baseline long-sequence models on 6 of 7 tasks in the Genomics Long Range Benchmark, and across standard DNA benchmarks PatchDNA matches or surpasses larger state-of-the-art models. Because patches are computed rather than learned as a fixed vocabulary, the patching strategy is a post-hoc, adjustable component, allowing the same trained model to be re-segmented for different downstream needs.

Applications

PatchDNA is aimed at genomics researchers and computational-biology teams who need efficient DNA foundation models for tasks such as regulatory-element annotation, variant effect prediction, and long-range genomic context modeling. Its parameter efficiency and long context make it attractive where compute or memory is constrained, or where modeling large genomic loci end-to-end matters. The ability to change patching at inference time is particularly useful for adapting a single pretrained model to new tasks or resolutions without the cost of retraining.

Impact

PatchDNA challenges the assumption that DNA language models must scale up parameters and vocabularies to improve, showing instead that biologically informed, flexible input representations can deliver state-of-the-art results at a fraction of the size. Its inference-time adjustability reframes tokenization from a fixed architectural commitment into a tunable knob, which could influence how future genomic foundation models are designed. As a recent preprint and conference paper, the reported gains await broader independent replication, and the public availability of code and weights was not confirmed at the time of writing.

Citation

PatchDNA: A Flexible and Biologically-Informed alternative to Tokenization for DNA

Preprint

Vecchio, A. D., et al. (2026) PatchDNA: A Flexible and Biologically-Informed alternative to Tokenization for DNA. bioRxiv.

DOI: 10.1101/2025.11.28.691095

Recent citations

Papers that recently cited this model.

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
Arnav Shah, Junzhe Li, Parsa Idehpour, et al.
arXiv.org · Feb 2026
1
DNACHUNKER: Learnable Tokenization for DNA Language Models
Taewon Kim, Jihwan Shin, Hyomin Kim, et al.
arXiv.org · Jan 2026
1

Top citations

The most-cited papers that cite this model.

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
Arnav Shah, Junzhe Li, Parsa Idehpour, et al.
arXiv.org · Feb 2026
1
DNACHUNKER: Learnable Tokenization for DNA Language Models
Taewon Kim, Jihwan Shin, Hyomin Kim, et al.
arXiv.org · Jan 2026
1

Citations

Total Citations12

Influential1

References49

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

33Closed

Usability — can I run it?33

Reproducibility — can I retrain it?26

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Conservation-guided patching: Patch boundaries are placed using evolutionary conservation scores, focusing model capacity on functionally important regions instead of arbitrary fixed tokens.

Tokenization-free flexibility: Patching replaces a frozen vocabulary, so the segmentation scheme can be altered at inference time without any retraining.

Extreme parameter efficiency: Models up to 10x smaller than prior approaches reach or exceed state-of-the-art accuracy on DNA benchmarks.

Long-range context: A 7.7M-parameter variant operates over a 131 kbp context window, enabling whole-locus modeling at very low parameter cost.

Technical Details

Applications

Impact

PatchDNA

Key Features

Technical Details

Applications

Impact

Citation

PatchDNA: A Flexible and Biologically-Informed alternative to Tokenization for DNA

Recent citations

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

DNACHUNKER: Learnable Tokenization for DNA Language Models

Top citations

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

DNACHUNKER: Learnable Tokenization for DNA Language Models

Citations

Fields of citing research

Openness

Tags

Resources

PatchDNA

Key Features

Technical Details

Applications

Impact

Citation

PatchDNA: A Flexible and Biologically-Informed alternative to Tokenization for DNA

Recent citations

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

DNACHUNKER: Learnable Tokenization for DNA Language Models

Top citations

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

DNACHUNKER: Learnable Tokenization for DNA Language Models

Citations

Fields of citing research

Openness

Tags

Resources

PatchDNA

#Key Features

#Technical Details

#Applications

#Impact

Citation

PatchDNA: A Flexible and Biologically-Informed alternative to Tokenization for DNA

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

PatchDNA

#Key Features

#Technical Details

#Applications

#Impact

Citation

PatchDNA: A Flexible and Biologically-Informed alternative to Tokenization for DNA

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact