LDARNet

Genomic foundation model with 120M parameters that learns adaptive DNA token boundaries by dynamic chunking, not fixed k-mer or byte-pair tokens.

Released: June 2026

Parameters: 120 Million

How DNA is split into discrete units is one of the central, unresolved design choices in genomic foundation models. Fixed schemes—k-mers, byte-pair encoding, or single nucleotides—impose a rigid granularity that can fragment biologically meaningful elements such as promoter motifs or splice junctions, while nucleotide-level modeling is faithful but expensive over genome-scale context. LDARNet (DNA Adaptive Representation Network with Learnable Tokenization), introduced in a June 2026 arXiv preprint by Daria Ledneva and Denis Kuznetsov, replaces fixed tokenization with token boundaries that the model learns during pretraining.

LDARNet adapts the H-Net-style dynamic chunking mechanism—originally framed for autoregressive modeling—to the masked language modeling (MLM) objective that underpins most bidirectional genomic encoders. A ratio-based regularizer induces adaptive token boundaries without any sequence-level supervision, so the model allocates fine resolution where the sequence warrants it and compresses elsewhere. The 120M-parameter checkpoint is pretrained on DNA via MLM, then frozen and fine-tuned across 27 downstream genomic tasks drawn from the Nucleotide Transformer and Genomic Benchmarks suites. Architecturally it sits closest to dnaHNet, a distinct tokenizer-free genomic model, but targets the encoder/MLM setting rather than autoregressive generation.

Key Features

Learnable adaptive tokenization: Replaces fixed k-mer/BPE/nucleotide vocabularies with token boundaries learned during pretraining, avoiding the fragmentation of motifs that rigid grids impose.
Dynamic chunking adapted for MLM: Brings H-Net-style dynamic chunking, previously used in autoregressive settings, into the masked-language-modeling regime used by bidirectional genomic encoders.
BiMamba-2 backbone with local attention: Combines bidirectional Mamba-2 state-space layers with local attention and bidirectional routing for efficient long-context sequence modeling.
Unsupervised boundary induction: A ratio-based regularizer induces adaptive token boundaries without sequence-level supervision.
Biologically aligned boundaries: Learned boundaries align with canonical promoter motifs and splice junctions, offering interpretability not available from fixed tokenizers.

Technical Details

LDARNet is a 120M-parameter genomic foundation model built on BiMamba-2 state-space layers interleaved with local attention and bidirectional routing, pretrained by masked language modeling on DNA. Its core contribution is a learnable adaptive tokenization scheme: an H-Net-style dynamic chunking module, adapted to the MLM objective, together with a ratio-based regularizer that induces adaptive token boundaries without sequence-level supervision. After pretraining, the fixed checkpoint is fine-tuned on 27 downstream tasks from the Nucleotide Transformer and Genomic Benchmarks suites. Among compact models (under 300M parameters) the authors report 11 of 18 wins, with state-of-the-art results on 5 histone-modification tasks—reported to outperform models up to 20× larger and to beat fixed-grid tokenization by as much as 14 percentage points at equivalent compute. Nucleotide-resolution analysis shows the learned boundaries coinciding with promoter motifs and splice junctions.

Applications

LDARNet targets functional genomics workflows where the granularity of sequence tokenization matters: histone-modification prediction, regulatory-element and promoter analysis, splice-site detection, and the broader battery of Nucleotide Transformer and Genomic Benchmarks tasks. Its compact size makes it attractive for groups that need competitive accuracy without the compute footprint of multi-billion parameter genomic models, and the interpretable, motif-aligned token boundaries can help researchers reason about which sequence elements drive a prediction.

Impact

LDARNet strengthens the case that learned, adaptive tokenization can outperform fixed schemes for genomic encoders, extending dynamic chunking from autoregressive models into the masked-language-modeling setting that dominates DNA representation learning. By matching or beating much larger models on histone-modification tasks while staying under 300M parameters, it argues for tokenization—rather than raw scale—as a lever for genomic performance. As a June 2026 preprint slated for ICML 2026, with code and weights expected to release by July 2026, its independent benchmark standing and adoption remain to be established.

Citation

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Preprint

Ledneva, D. & Kuznetsov, D. (2026) LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling.

DOI: 10.48550/arXiv.2606.04552

Recent citations

Papers that recently cited this model.

GeneZip: Region-Aware Compression for Long Context DNA Modeling
Jianan Zhao, Xixian Liu, Zhihao Zhan, et al.
arXiv.org · Feb 2026
0

Top citations

The most-cited papers that cite this model.

GeneZip: Region-Aware Compression for Long Context DNA Modeling
Jianan Zhao, Xixian Liu, Zhihao Zhan, et al.
arXiv.org · Feb 2026
0

Citations

Total Citations1

Influential0

References20

GitHub

Stars4

Forks0

Open Issues0

Contributors1

Last Push24d ago

LanguageJupyter Notebook

LicenseApache-2.0

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

26Closed

Usability — can I run it?15

Reproducibility — can I retrain it?24

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

Learnable adaptive tokenization: Replaces fixed k-mer/BPE/nucleotide vocabularies with token boundaries learned during pretraining, avoiding the fragmentation of motifs that rigid grids impose.

Dynamic chunking adapted for MLM: Brings H-Net-style dynamic chunking, previously used in autoregressive settings, into the masked-language-modeling regime used by bidirectional genomic encoders.

BiMamba-2 backbone with local attention: Combines bidirectional Mamba-2 state-space layers with local attention and bidirectional routing for efficient long-context sequence modeling.

Unsupervised boundary induction: A ratio-based regularizer induces adaptive token boundaries without sequence-level supervision.

Biologically aligned boundaries: Learned boundaries align with canonical promoter motifs and splice junctions, offering interpretability not available from fixed tokenizers.

Technical Details

Applications

Impact

LDARNet

Key Features

Technical Details

Applications

Impact

Citation

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Recent citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Top citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

LDARNet

Key Features

Technical Details

Applications

Impact

Citation

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Recent citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Top citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

LDARNet

#Key Features

#Technical Details

#Applications

#Impact

Citation

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Recent citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Top citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

LDARNet

#Key Features

#Technical Details

#Applications

#Impact

Citation

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Recent citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Top citations

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact