Westlake University / Biomap Research
Hierarchical DNA foundation model that learns a dynamic Token Merging tokenizer jointly with latent Transformers, adapting tokenization to varying genomic information density.
MergeDNA is a hierarchical DNA foundation model introduced in a November 2025 arXiv preprint from Stan Z. Li's lab at Westlake University together with BioMap Research, and accepted as an oral presentation at AAAI 2026. It targets two longstanding and unresolved problems in genomic sequence modeling: information density varies widely across genomic regions, and there is no clearly defined minimum vocabulary unit for DNA. Existing approaches either fall back on the four primitive nucleotide bases or rely on independently designed, fixed DNA tokenizers, and when paired with naive masked language modeling pretraining they often fail to adapt to the differing complexity of genomic regions.
The model's central idea is to make tokenization itself learnable and context-dependent. Borrowing Token Merging techniques from the broader deep learning literature, MergeDNA couples a dynamic genomic tokenizer with latent Transformers in a single jointly optimized, hierarchical architecture. Rather than committing to a static vocabulary, the tokenizer learns to chunk adjacent bases into variable "words" that reflect local sequence structure, while higher-level Transformers reason over the merged representation. This lets the model allocate representational capacity according to the actual information content of each region.
MergeDNA sits within the rapidly growing family of DNA foundation models such as the Nucleotide Transformer, DNABERT, HyenaDNA, and the Evo lineage, but distinguishes itself by treating tokenization as a trainable component of the pretraining objective rather than a fixed preprocessing step. The authors report that this design outperforms both conventional tokenization methods and large-scale DNA foundation models across standard benchmarks.
MergeDNA is a Transformer-based hierarchical model whose defining component is a differentiable token-merging tokenizer. The tokenizer applies several layers of token-merging blocks constrained to local windows, progressively grouping adjacent nucleotides into merged words; a Latent Encoder then applies full self-attention over these words to model global context, while a Latent Decoder and a Local Decoder restore detail in a symmetric encoder-decoder structure. Pretraining combines two objectives: Merged Token Reconstruction, which jointly trains the dynamic tokenization module and adaptively filters salient tokens, and Adaptive Masked Token Modeling, which learns to predict the filtered tokens. The authors report that MergeDNA outperforms typical tokenization methods and large-scale DNA foundation models across three popular DNA benchmarks and several multi-omics tasks under both fine-tuning and zero-shot settings. The preprint (arXiv:2511.14806) was posted on 17 November 2025; precise parameter counts, context length, and pretraining corpus details are reported in the primary source and are not restated here pending confirmation.
MergeDNA is aimed at computational genomics researchers who need general-purpose sequence representations that adapt to heterogeneous genomic content. Because tokenization is learned rather than fixed, the model is positioned to handle regions of differing information density — from dense coding sequence to sparser intergenic or regulatory DNA — without manual vocabulary engineering. The reported benchmark coverage spans standard DNA classification tasks and several multi-omics prediction tasks, suggesting use in variant effect prediction, regulatory and functional element analysis, and other downstream genomic prediction problems. Its zero-shot capability is particularly relevant for settings where labeled data is scarce, allowing the pretrained backbone to be applied without task-specific fine-tuning.
MergeDNA contributes a concrete demonstration that tokenization can be a trainable, context-aware part of DNA foundation models rather than a fixed preprocessing choice, directly addressing the open question of how to define vocabulary units for genomic sequences. Its acceptance as an oral presentation at AAAI 2026 signals interest from the broader machine learning community in adaptive tokenization for biological sequences. As of June 2026, no official code or model weights had been released — the GitHub repositories located were third-party course reproductions rather than authoritative releases — and the work is a preprint that has not yet appeared in a peer-reviewed proceedings volume. These openness and availability gaps are the primary caveats for prospective users, and independent reproduction or benchmarking awaits an official artifact release.
Li, S., et al. (2025) MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging. Proceedings of the AAAI Conference on Artificial Intelligence.
DOI: 10.48550/arXiv.2511.14806Papers that recently cited this model.
Loubna Ben Allal, Qiuyi Li, Maurizio Fiusco, et al.
bioRxiv · May 2026
Qiuyi Li, Zhihao Zhan, Shikun Feng, et al.
bioRxiv · May 2026
Jianan Zhao, Xixian Liu, Zhihao Zhan, et al.
arXiv.org · Feb 2026
The most-cited papers that cite this model.
Arnav Shah, Junzhe Li, Parsa Idehpour, et al.
arXiv.org · Feb 2026
Qiuyi Li, Zhihao Zhan, Shikun Feng, et al.
bioRxiv · May 2026
Jianan Zhao, Xixian Liu, Zhihao Zhan, et al.
arXiv.org · Feb 2026
Loubna Ben Allal, Qiuyi Li, Maurizio Fiusco, et al.
bioRxiv · May 2026
Share of papers citing this model.