bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

LDARNet

Independent Researcher

A 120M-parameter genomic foundation model that learns adaptive DNA token boundaries via H-Net-style dynamic chunking instead of fixed k-mer or byte-pair tokenization.

Released: June 2026
Parameters: 120 Million

How DNA is split into discrete units is one of the central, unresolved design choices in genomic foundation models. Fixed schemes—k-mers, byte-pair encoding, or single nucleotides—impose a rigid granularity that can fragment biologically meaningful elements such as promoter motifs or splice junctions, while nucleotide-level modeling is faithful but expensive over genome-scale context. LDARNet (DNA Adaptive Representation Network with Learnable Tokenization), introduced in a June 2026 arXiv preprint by Daria Ledneva and Denis Kuznetsov, replaces fixed tokenization with token boundaries that the model learns during pretraining.

LDARNet adapts the H-Net-style dynamic chunking mechanism—originally framed for autoregressive modeling—to the masked language modeling (MLM) objective that underpins most bidirectional genomic encoders. A ratio-based regularizer induces adaptive token boundaries without any sequence-level supervision, so the model allocates fine resolution where the sequence warrants it and compresses elsewhere. The 120M-parameter checkpoint is pretrained on DNA via MLM, then frozen and fine-tuned across 27 downstream genomic tasks drawn from the Nucleotide Transformer and Genomic Benchmarks suites. Architecturally it sits closest to dnaHNet, a distinct tokenizer-free genomic model, but targets the encoder/MLM setting rather than autoregressive generation.

#Key Features

  • Learnable adaptive tokenization: Replaces fixed k-mer/BPE/nucleotide vocabularies with token boundaries learned during pretraining, avoiding the fragmentation of motifs that rigid grids impose.
  • Dynamic chunking adapted for MLM: Brings H-Net-style dynamic chunking, previously used in autoregressive settings, into the masked-language-modeling regime used by bidirectional genomic encoders.
  • BiMamba-2 backbone with local attention: Combines bidirectional Mamba-2 state-space layers with local attention and bidirectional routing for efficient long-context sequence modeling.
  • Unsupervised boundary induction: A ratio-based regularizer induces adaptive token boundaries without sequence-level supervision.
  • Biologically aligned boundaries: Learned boundaries align with canonical promoter motifs and splice junctions, offering interpretability not available from fixed tokenizers.

#Technical Details

LDARNet is a 120M-parameter genomic foundation model built on BiMamba-2 state-space layers interleaved with local attention and bidirectional routing, pretrained by masked language modeling on DNA. Its core contribution is a learnable adaptive tokenization scheme: an H-Net-style dynamic chunking module, adapted to the MLM objective, together with a ratio-based regularizer that induces adaptive token boundaries without sequence-level supervision. After pretraining, the fixed checkpoint is fine-tuned on 27 downstream tasks from the Nucleotide Transformer and Genomic Benchmarks suites. Among compact models (under 300M parameters) the authors report 11 of 18 wins, with state-of-the-art results on 5 histone-modification tasks—reported to outperform models up to 20× larger and to beat fixed-grid tokenization by as much as 14 percentage points at equivalent compute. Nucleotide-resolution analysis shows the learned boundaries coinciding with promoter motifs and splice junctions.

#Applications

LDARNet targets functional genomics workflows where the granularity of sequence tokenization matters: histone-modification prediction, regulatory-element and promoter analysis, splice-site detection, and the broader battery of Nucleotide Transformer and Genomic Benchmarks tasks. Its compact size makes it attractive for groups that need competitive accuracy without the compute footprint of multi-billion parameter genomic models, and the interpretable, motif-aligned token boundaries can help researchers reason about which sequence elements drive a prediction.

#Impact

LDARNet strengthens the case that learned, adaptive tokenization can outperform fixed schemes for genomic encoders, extending dynamic chunking from autoregressive models into the masked-language-modeling setting that dominates DNA representation learning. By matching or beating much larger models on histone-modification tasks while staying under 300M parameters, it argues for tokenization—rather than raw scale—as a lever for genomic performance. As a June 2026 preprint slated for ICML 2026, with code and weights expected to release by July 2026, its independent benchmark standing and adoption remain to be established.

Citation

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Preprint

Ledneva, D. & Kuznetsov, D. (2026) LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling.

DOI: 10.48550/arXiv.2606.04552

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations1
Influential0
References20

GitHub

Stars0
Forks0
Open Issues0
Contributors1
Last Push7d ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
26Closed
Usability — can I run it?15
Reproducibility — can I retrain it?24
Model Openness Framework
Unclassified
Missing required components

Tags

dnafoundation_modelgene_expressiongenomicsrepresentation_learningself_supervisedstate_space_modeltransformervariant_effect_prediction

Resources

GitHub RepositoryResearch Paper