Arc Institute / Carnegie Mellon University / University of Toronto
A tokenizer-free, hierarchical autoregressive genomic foundation model that adaptively chunks raw nucleotides, enabling efficient long-context learning and zero-shot variant and gene predictions.
Genomic foundation models face a basic tension: how DNA is broken into units. Fixed k-mer or byte-pair tokenizers can fragment biologically meaningful motifs and impose a rigid granularity, while operating directly on single nucleotides is faithful but computationally expensive over genome-scale context. dnaHNet, introduced in a February 2026 arXiv preprint, sidesteps this by being tokenizer-free: it learns, during training, how to compress raw nucleotides into latent tokens rather than committing to a predefined vocabulary.
The model is hierarchical and autoregressive. A differentiable dynamic chunking mechanism adaptively groups nucleotides into latent units, so the model allocates resolution where the sequence warrants it and compresses elsewhere. Pretrained on prokaryotic genomes, dnaHNet is reported to scale more favorably than existing architectures and to recover hierarchical biological structure without explicit supervision, while excelling at zero-shot tasks such as protein-variant fitness and gene-essentiality prediction. The author list spans groups associated with the Arc Institute, Carnegie Mellon University, and the University of Toronto.
dnaHNet is an autoregressive genomic foundation model that replaces a fixed tokenizer with a differentiable dynamic chunking mechanism, adaptively compressing raw nucleotides into latent tokens within a hierarchical architecture. This design preserves biological motifs while reducing compute, yielding reported quadratic FLOP reductions and a greater-than-3× inference speedup over Transformer baselines, along with improved scaling. The model is pretrained on prokaryotic genomes and evaluated in a zero-shot setting on protein-variant fitness and gene-essentiality tasks, where it is reported to outperform existing architectures; it also recovers hierarchical biological structure without supervision. The preprint has been revised through v3 (latest April 2026); exact parameter counts, training-corpus size, and code or weight availability should be confirmed against the paper.
By predicting variant fitness and gene essentiality zero-shot, dnaHNet is suited to functional genomics workflows—prioritizing candidate variants, flagging essential genes, and screening sequence effects without task-specific labels. Its tokenizer-free, efficient design is attractive for long-context genomic settings, and its prokaryotic pretraining makes it directly relevant to microbial and metagenomic genomics.
dnaHNet advances the case that learned, adaptive sequence compression can outperform fixed tokenization for genomic foundation models, both in efficiency and in capturing biological structure. If its scaling and zero-shot advantages hold under independent evaluation, the dynamic-chunking approach could influence how future DNA models handle granularity. As a recent preprint, broader adoption and benchmark standing remain to be established.