MergeDNA

Hierarchical DNA foundation model that learns a dynamic Token Merging tokenizer jointly with latent Transformers, adapting tokenization to varying genomic information density.

Released: November 2025

MergeDNA is a hierarchical DNA foundation model introduced in a November 2025 arXiv preprint from Stan Z. Li's lab at Westlake University together with BioMap Research, and accepted as an oral presentation at AAAI 2026. It targets two longstanding and unresolved problems in genomic sequence modeling: information density varies widely across genomic regions, and there is no clearly defined minimum vocabulary unit for DNA. Existing approaches either fall back on the four primitive nucleotide bases or rely on independently designed, fixed DNA tokenizers, and when paired with naive masked language modeling pretraining they often fail to adapt to the differing complexity of genomic regions.

The model's central idea is to make tokenization itself learnable and context-dependent. Borrowing Token Merging techniques from the broader deep learning literature, MergeDNA couples a dynamic genomic tokenizer with latent Transformers in a single jointly optimized, hierarchical architecture. Rather than committing to a static vocabulary, the tokenizer learns to chunk adjacent bases into variable "words" that reflect local sequence structure, while higher-level Transformers reason over the merged representation. This lets the model allocate representational capacity according to the actual information content of each region.

MergeDNA sits within the rapidly growing family of DNA foundation models such as the Nucleotide Transformer, DNABERT, HyenaDNA, and the Evo lineage, but distinguishes itself by treating tokenization as a trainable component of the pretraining objective rather than a fixed preprocessing step. The authors report that this design outperforms both conventional tokenization methods and large-scale DNA foundation models across standard benchmarks.

Key Features

Dynamic Token Merging tokenizer: A tokenization module stacks multiple layers of differentiable token-merging blocks with local-window constraints to automatically chunk adjacent bases into variable-length "words," replacing fixed vocabularies or raw-base tokenization.
Hierarchical latent architecture: After merging, a Latent Encoder captures global context across merged words using full-attention blocks, with a symmetric Latent Decoder and Local Decoder reconstructing fine-grained sequence detail.
Context-aware pretraining tasks: Merged Token Reconstruction trains the dynamic tokenizer while adaptively filtering important tokens, and Adaptive Masked Token Modeling predicts those filtered tokens to capture informative content.
Jointly optimized tokenizer and model: Tokenization and the latent Transformers are trained together rather than in separate stages, allowing the vocabulary to adapt to the varying complexity of genomic sequences.
Strong fine-tuning and zero-shot results: The model reports superior performance on three popular DNA benchmarks and several multi-omics tasks under both fine-tuning and zero-shot evaluation.

Technical Details

MergeDNA is a Transformer-based hierarchical model whose defining component is a differentiable token-merging tokenizer. The tokenizer applies several layers of token-merging blocks constrained to local windows, progressively grouping adjacent nucleotides into merged words; a Latent Encoder then applies full self-attention over these words to model global context, while a Latent Decoder and a Local Decoder restore detail in a symmetric encoder-decoder structure. Pretraining combines two objectives: Merged Token Reconstruction, which jointly trains the dynamic tokenization module and adaptively filters salient tokens, and Adaptive Masked Token Modeling, which learns to predict the filtered tokens. The authors report that MergeDNA outperforms typical tokenization methods and large-scale DNA foundation models across three popular DNA benchmarks and several multi-omics tasks under both fine-tuning and zero-shot settings. The preprint (arXiv:2511.14806) was posted on 17 November 2025; precise parameter counts, context length, and pretraining corpus details are reported in the primary source and are not restated here pending confirmation.

Applications

MergeDNA is aimed at computational genomics researchers who need general-purpose sequence representations that adapt to heterogeneous genomic content. Because tokenization is learned rather than fixed, the model is positioned to handle regions of differing information density — from dense coding sequence to sparser intergenic or regulatory DNA — without manual vocabulary engineering. The reported benchmark coverage spans standard DNA classification tasks and several multi-omics prediction tasks, suggesting use in variant effect prediction, regulatory and functional element analysis, and other downstream genomic prediction problems. Its zero-shot capability is particularly relevant for settings where labeled data is scarce, allowing the pretrained backbone to be applied without task-specific fine-tuning.

Impact

MergeDNA contributes a concrete demonstration that tokenization can be a trainable, context-aware part of DNA foundation models rather than a fixed preprocessing choice, directly addressing the open question of how to define vocabulary units for genomic sequences. Its acceptance as an oral presentation at AAAI 2026 signals interest from the broader machine learning community in adaptive tokenization for biological sequences. As of June 2026, no official code or model weights had been released — the GitHub repositories located were third-party course reproductions rather than authoritative releases — and the work is a preprint that has not yet appeared in a peer-reviewed proceedings volume. These openness and availability gaps are the primary caveats for prospective users, and independent reproduction or benchmarking awaits an official artifact release.

Citation

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Preprint

Li, S., et al. (2025) MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging. Proceedings of the AAAI Conference on Artificial Intelligence.

DOI: 10.48550/arXiv.2511.14806

Recent citations

Papers that recently cited this model.

Carbon: Decoding the Language of Life
Loubna Ben Allal, Qiuyi Li, Maurizio Fiusco, et al.
bioRxiv · May 2026
0
GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling
Qiuyi Li, Zhihao Zhan, Shikun Feng, et al.
bioRxiv · May 2026
1
GeneZip: Region-Aware Compression for Long Context DNA Modeling
Jianan Zhao, Xixian Liu, Zhihao Zhan, et al.
arXiv.org · Feb 2026
0

Top citations

The most-cited papers that cite this model.

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning
Arnav Shah, Junzhe Li, Parsa Idehpour, et al.
arXiv.org · Feb 2026
1
GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling
Qiuyi Li, Zhihao Zhan, Shikun Feng, et al.
bioRxiv · May 2026
1
GeneZip: Region-Aware Compression for Long Context DNA Modeling
Jianan Zhao, Xixian Liu, Zhihao Zhan, et al.
arXiv.org · Feb 2026
0
Carbon: Decoding the Language of Life
Loubna Ben Allal, Qiuyi Li, Maurizio Fiusco, et al.
bioRxiv · May 2026
0

Citations

Total Citations4

Influential0

References82

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

5Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

arXiv Preprint

Key Features

Dynamic Token Merging tokenizer: A tokenization module stacks multiple layers of differentiable token-merging blocks with local-window constraints to automatically chunk adjacent bases into variable-length "words," replacing fixed vocabularies or raw-base tokenization.

Hierarchical latent architecture: After merging, a Latent Encoder captures global context across merged words using full-attention blocks, with a symmetric Latent Decoder and Local Decoder reconstructing fine-grained sequence detail.

Context-aware pretraining tasks: Merged Token Reconstruction trains the dynamic tokenizer while adaptively filtering important tokens, and Adaptive Masked Token Modeling predicts those filtered tokens to capture informative content.

Jointly optimized tokenizer and model: Tokenization and the latent Transformers are trained together rather than in separate stages, allowing the vocabulary to adapt to the varying complexity of genomic sequences.

Strong fine-tuning and zero-shot results: The model reports superior performance on three popular DNA benchmarks and several multi-omics tasks under both fine-tuning and zero-shot evaluation.

Technical Details

Applications

Impact

Citation

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Preprint

Li, S., et al. (2025) MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging. Proceedings of the AAAI Conference on Artificial Intelligence.

DOI: 10.48550/arXiv.2511.14806

MergeDNA

#Key Features

#Technical Details

#Applications

#Impact

Citation

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

MergeDNA

#Key Features

#Technical Details

#Applications

#Impact

Citation

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact