dnaHNet

Arc Institute / Carnegie Mellon University / University of Toronto

Tokenizer-free genomic foundation model that adaptively chunks raw nucleotides, enabling zero-shot variant fitness and gene essentiality prediction.

Released: February 2026

Genomic foundation models face a basic tension: how DNA is broken into units. Fixed k-mer or byte-pair tokenizers can fragment biologically meaningful motifs and impose a rigid granularity, while operating directly on single nucleotides is faithful but computationally expensive over genome-scale context. dnaHNet, introduced in a February 2026 arXiv preprint, sidesteps this by being tokenizer-free: it learns, during training, how to compress raw nucleotides into latent tokens rather than committing to a predefined vocabulary.

The model is hierarchical and autoregressive. A differentiable dynamic chunking mechanism adaptively groups nucleotides into latent units, so the model allocates resolution where the sequence warrants it and compresses elsewhere. Pretrained on prokaryotic genomes, dnaHNet is reported to scale more favorably than existing architectures and to recover hierarchical biological structure without explicit supervision, while excelling at zero-shot tasks such as protein-variant fitness and gene-essentiality prediction. The author list spans groups associated with the Arc Institute, Carnegie Mellon University, and the University of Toronto.

Key Features

Tokenizer-free modeling: Operates on raw nucleotides and learns its own compression, avoiding fixed k-mer or byte-pair vocabularies that can split motifs.
Differentiable dynamic chunking: Adaptively compresses nucleotides into latent tokens, balancing motif preservation against computational cost.
Hierarchical structure discovery: Automatically uncovers hierarchical biological organization without explicit supervision.
Efficiency at scale: Reports quadratic FLOP reductions and over 3× inference speedup relative to Transformers, with favorable scaling behavior.
Strong zero-shot transfer: Performs well on zero-shot protein-variant fitness and gene-essentiality prediction.

Technical Details

dnaHNet is an autoregressive genomic foundation model that replaces a fixed tokenizer with a differentiable dynamic chunking mechanism, adaptively compressing raw nucleotides into latent tokens within a hierarchical architecture. This design preserves biological motifs while reducing compute, yielding reported quadratic FLOP reductions and a greater-than-3× inference speedup over Transformer baselines, along with improved scaling. The model is pretrained on prokaryotic genomes and evaluated in a zero-shot setting on protein-variant fitness and gene-essentiality tasks, where it is reported to outperform existing architectures; it also recovers hierarchical biological structure without supervision. The preprint has been revised through v3 (latest April 2026); exact parameter counts, training-corpus size, and code or weight availability should be confirmed against the paper.

Applications

By predicting variant fitness and gene essentiality zero-shot, dnaHNet is suited to functional genomics workflows—prioritizing candidate variants, flagging essential genes, and screening sequence effects without task-specific labels. Its tokenizer-free, efficient design is attractive for long-context genomic settings, and its prokaryotic pretraining makes it directly relevant to microbial and metagenomic genomics.

Impact

dnaHNet advances the case that learned, adaptive sequence compression can outperform fixed tokenization for genomic foundation models, both in efficiency and in capturing biological structure. If its scaling and zero-shot advantages hold under independent evaluation, the dynamic-chunking approach could influence how future DNA models handle granularity. As a recent preprint, broader adoption and benchmark standing remain to be established.

Citation

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Preprint

Shah, A., et al. (2026) dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning. arXiv.org.

DOI: 10.48550/arXiv.2602.10603

Recent citations

Papers that recently cited this model.

GeneZip: Region-Aware Compression for Long Context DNA Modeling
Jianan Zhao, Xixian Liu, Zhihao Zhan, et al.
arXiv.org · Feb 2026
0

Top citations

The most-cited papers that cite this model.

GeneZip: Region-Aware Compression for Long Context DNA Modeling
Jianan Zhao, Xixian Liu, Zhihao Zhan, et al.
arXiv.org · Feb 2026
0

Citations

Total Citations2

Influential0

References32

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

12Closed

Usability — can I run it?9

Reproducibility — can I retrain it?13

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Tokenizer-free modeling: Operates on raw nucleotides and learns its own compression, avoiding fixed k-mer or byte-pair vocabularies that can split motifs.

Differentiable dynamic chunking: Adaptively compresses nucleotides into latent tokens, balancing motif preservation against computational cost.

Hierarchical structure discovery: Automatically uncovers hierarchical biological organization without explicit supervision.

Efficiency at scale: Reports quadratic FLOP reductions and over 3× inference speedup relative to Transformers, with favorable scaling behavior.

Strong zero-shot transfer: Performs well on zero-shot protein-variant fitness and gene-essentiality prediction.

Technical Details

Applications

Impact

dnaHNet

Key Features

Technical Details

Applications

Impact

Citation

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Recent citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Top citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Citations

Fields of citing research

Openness

Tags

Resources

dnaHNet

Key Features

Technical Details

Applications

Impact

Citation

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Recent citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Top citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Citations

Fields of citing research

Openness

Tags

Resources

dnaHNet

#Key Features

#Technical Details

#Applications

#Impact

Citation

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Recent citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Top citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Related models

Citations

Fields of citing research

Openness

Tags

Resources

dnaHNet

#Key Features

#Technical Details

#Applications

#Impact

Citation

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Recent citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Top citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact