bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

D3LM

Renmin University of China

A bidirectional masked discrete diffusion language model over DNA, initialized from Nucleotide Transformer v2, that unifies DNA understanding and generation.

Released: March 2026
Parameters: 55.9 Million

D3LM (Discrete DNA Diffusion Language Model) is a DNA foundation model that unifies bidirectional sequence understanding and de novo sequence generation within a single architecture. Most genomic language models fall into one of two camps: bidirectional masked encoders such as the Nucleotide Transformer family, which produce rich representations for classification and variant scoring but cannot generate sequences, and left-to-right autoregressive models, which can sample new DNA but read context in only one direction. D3LM bridges this gap by training a masked discrete diffusion objective in nucleotide space, so the same model both encodes bidirectional context and generates DNA by iteratively denoising masked tokens.

The model was introduced by Zhao Yang, Hengchang Liu, Chuan Cao, and Bing Su of the Gaoling School of Artificial Intelligence at Renmin University of China, in a preprint posted to arXiv in March 2026 and accepted as a workshop paper at MLGenX 2026. Rather than design a new backbone from scratch, the authors build on the proven Nucleotide Transformer v2 encoder and reformulate its training as discrete diffusion, demonstrating that an established bidirectional encoder can be converted into a capable generative model.

The central result is a substantial improvement in unconditional regulatory-element generation: D3LM reports a Sequence-FID (SFID) of 10.92, compared with 29.16 for a prior autoregressive baseline, approaching the 7.85 reference computed on real genomic DNA.

#Key Features

  • Unified understanding and generation: A single masked discrete diffusion objective supports both bidirectional representation learning and de novo DNA sampling, removing the usual trade-off between encoder-style and decoder-style genomic models.
  • Bidirectional generation via diffusion: Sequences are produced by iteratively unmasking tokens with full bidirectional context at every step, rather than committing to a single left-to-right pass, which better respects the non-causal structure of regulatory DNA.
  • Built on Nucleotide Transformer v2: The released D3LM-from-nt checkpoint is initialized from nucleotide-transformer-v2-50m-multi-species and fine-tuned with the diffusion objective, while a D3LM-scratch variant is trained from random initialization for comparison.
  • Flexible decoding strategies: Generation supports configurable temperature, nucleus sampling, and multiple unmasking schedules (random, entropy, maskgit_plus, topk_margin, p2), giving control over sample diversity and fidelity.
  • Open weights: Both checkpoints are released on HuggingFace under the Apache 2.0 license.

#Technical Details

D3LM is a roughly 56M-parameter transformer encoder (about 50M trainable parameters) with 12 layers, 512 hidden dimensions, 16 attention heads, rotary positional embeddings, and a maximum context of 2,048 tokens over a 4,107-token vocabulary. Training uses a masked diffusion objective on mammalian DNA: tokens are corrupted by masking according to a noise schedule and the model learns to recover them, which at inference time is run in reverse to generate sequences from a fully masked state. The D3LM-from-nt checkpoint warm-starts from the Nucleotide Transformer v2 50M multi-species encoder before diffusion fine-tuning; D3LM-scratch trains the same architecture from random initialization. On unconditional regulatory-element generation, D3LM achieves an SFID of 10.92 versus 29.16 for a comparable autoregressive approach, with a real-DNA reference of 7.85, indicating generated sequences whose feature statistics are markedly closer to genuine regulatory DNA.

#Applications

D3LM is aimed at researchers in regulatory genomics and synthetic biology who need to both analyze and design DNA. Its generative side supports de novo design of regulatory elements such as promoters and enhancers, where producing sequences whose statistical properties match real genomic DNA is a prerequisite for downstream synthesis and screening. Because the same model retains a bidirectional encoder, its representations can also be applied to standard understanding tasks such as functional annotation and variant analysis, letting a single model serve both design and characterization workflows. The Apache-2.0 HuggingFace checkpoints can be loaded directly and fine-tuned on task- or organism-specific data.

#Impact

D3LM contributes to a growing line of work showing that discrete diffusion is a practical route to generative DNA models that retain the bidirectional context lost by autoregressive approaches. By converting an established bidirectional encoder, the Nucleotide Transformer v2, into a generator rather than designing a bespoke architecture, it offers a reproducible template for upgrading existing genomic encoders with generation capability. The reported gains on regulatory-element generation (SFID 10.92 versus 29.16) are notable, though as a recent workshop preprint the model has a limited published benchmark suite, modest scale (~56M parameters), a 2,048-token context, and training restricted to mammalian DNA; broader evaluation and larger-scale variants remain open directions. No standalone GitHub code repository has been confirmed; the released artifacts are the HuggingFace checkpoints.

Tags

sequence_generationregulatory_element_designdiffusiontransformerfoundation_modelgenerativeself_supervisedgenomicsdna