A 120M-parameter genomic foundation model that learns adaptive DNA token boundaries via H-Net-style dynamic chunking instead of fixed k-mer or byte-pair tokenization.
How DNA is split into discrete units is one of the central, unresolved design choices in genomic foundation models. Fixed schemes—k-mers, byte-pair encoding, or single nucleotides—impose a rigid granularity that can fragment biologically meaningful elements such as promoter motifs or splice junctions, while nucleotide-level modeling is faithful but expensive over genome-scale context. LDARNet (DNA Adaptive Representation Network with Learnable Tokenization), introduced in a June 2026 arXiv preprint by Daria Ledneva and Denis Kuznetsov, replaces fixed tokenization with token boundaries that the model learns during pretraining.
LDARNet adapts the H-Net-style dynamic chunking mechanism—originally framed for autoregressive modeling—to the masked language modeling (MLM) objective that underpins most bidirectional genomic encoders. A ratio-based regularizer induces adaptive token boundaries without any sequence-level supervision, so the model allocates fine resolution where the sequence warrants it and compresses elsewhere. The 120M-parameter checkpoint is pretrained on DNA via MLM, then frozen and fine-tuned across 27 downstream genomic tasks drawn from the Nucleotide Transformer and Genomic Benchmarks suites. Architecturally it sits closest to dnaHNet, a distinct tokenizer-free genomic model, but targets the encoder/MLM setting rather than autoregressive generation.
LDARNet is a 120M-parameter genomic foundation model built on BiMamba-2 state-space layers interleaved with local attention and bidirectional routing, pretrained by masked language modeling on DNA. Its core contribution is a learnable adaptive tokenization scheme: an H-Net-style dynamic chunking module, adapted to the MLM objective, together with a ratio-based regularizer that induces adaptive token boundaries without sequence-level supervision. After pretraining, the fixed checkpoint is fine-tuned on 27 downstream tasks from the Nucleotide Transformer and Genomic Benchmarks suites. Among compact models (under 300M parameters) the authors report 11 of 18 wins, with state-of-the-art results on 5 histone-modification tasks—reported to outperform models up to 20× larger and to beat fixed-grid tokenization by as much as 14 percentage points at equivalent compute. Nucleotide-resolution analysis shows the learned boundaries coinciding with promoter motifs and splice junctions.
LDARNet targets functional genomics workflows where the granularity of sequence tokenization matters: histone-modification prediction, regulatory-element and promoter analysis, splice-site detection, and the broader battery of Nucleotide Transformer and Genomic Benchmarks tasks. Its compact size makes it attractive for groups that need competitive accuracy without the compute footprint of multi-billion parameter genomic models, and the interpretable, motif-aligned token boundaries can help researchers reason about which sequence elements drive a prediction.
LDARNet strengthens the case that learned, adaptive tokenization can outperform fixed schemes for genomic encoders, extending dynamic chunking from autoregressive models into the masked-language-modeling setting that dominates DNA representation learning. By matching or beating much larger models on histone-modification tasks while staying under 300M parameters, it argues for tokenization—rather than raw scale—as a lever for genomic performance. As a June 2026 preprint slated for ICML 2026, with code and weights expected to release by July 2026, its independent benchmark standing and adoption remain to be established.
Ledneva, D. & Kuznetsov, D. (2026) LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling.
DOI: 10.48550/arXiv.2606.04552Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data