A DNA language model that replaces fixed tokenization with conservation-guided patching, letting models up to 10x smaller match or beat state-of-the-art genomic benchmarks.
DNA language models inherit a design choice from natural-language processing: tokenization. Whether using single nucleotides, fixed k-mers, or byte-pair encoding, these schemes are decided before training and frozen into the model, often splitting the genome in ways that ignore biology and forcing larger models and longer context windows to compensate. PatchDNA argues that this fixed tokenization is a bottleneck and replaces it with a flexible, biologically informed alternative called patching.
Developed at Relation Therapeutics and presented as an ICLR 2026 paper (with a bioRxiv preprint), PatchDNA segments DNA into contiguous, variable-length patches rather than a fixed vocabulary of tokens. During pretraining, patch boundaries are guided by evolutionary conservation scores, concentrating the model's capacity on functionally important, conserved regions while compressing less informative stretches. Crucially, because patching is a preprocessing strategy rather than a learned vocabulary, the boundaries can be changed at inference time without retraining the model—a flexibility that fixed tokenizers cannot offer.
The result is a striking efficiency story: PatchDNA reports that models up to an order of magnitude smaller than current systems match or surpass state-of-the-art performance on established DNA benchmarks, building on the byte-latent-transformer line of work while grounding patch boundaries in genomic conservation.
PatchDNA pretrains transformer models that consume variable-length DNA patches whose boundaries are derived from evolutionary conservation scores. The work releases two main configurations: a 19.2M-parameter model with a 16 kbp context window and a 7.7M-parameter model with a 131 kbp context window. The smaller, long-context variant is reported to outperform baseline long-sequence models on 6 of 7 tasks in the Genomics Long Range Benchmark, and across standard DNA benchmarks PatchDNA matches or surpasses larger state-of-the-art models. Because patches are computed rather than learned as a fixed vocabulary, the patching strategy is a post-hoc, adjustable component, allowing the same trained model to be re-segmented for different downstream needs.
PatchDNA is aimed at genomics researchers and computational-biology teams who need efficient DNA foundation models for tasks such as regulatory-element annotation, variant effect prediction, and long-range genomic context modeling. Its parameter efficiency and long context make it attractive where compute or memory is constrained, or where modeling large genomic loci end-to-end matters. The ability to change patching at inference time is particularly useful for adapting a single pretrained model to new tasks or resolutions without the cost of retraining.
PatchDNA challenges the assumption that DNA language models must scale up parameters and vocabularies to improve, showing instead that biologically informed, flexible input representations can deliver state-of-the-art results at a fraction of the size. Its inference-time adjustability reframes tokenization from a fixed architectural commitment into a tunable knob, which could influence how future genomic foundation models are designed. As a recent preprint and conference paper, the reported gains await broader independent replication, and the public availability of code and weights was not confirmed at the time of writing.