D3LM

DNA foundation model using masked discrete diffusion to unify bidirectional sequence understanding and de novo generation in one architecture.

Released: March 2026

Parameters: 55.9 Million

D3LM (Discrete DNA Diffusion Language Model) is a DNA foundation model that unifies bidirectional sequence understanding and de novo sequence generation within a single architecture. Most genomic language models fall into one of two camps: bidirectional masked encoders such as the Nucleotide Transformer family, which produce rich representations for classification and variant scoring but cannot generate sequences, and left-to-right autoregressive models, which can sample new DNA but read context in only one direction. D3LM bridges this gap by training a masked discrete diffusion objective in nucleotide space, so the same model both encodes bidirectional context and generates DNA by iteratively denoising masked tokens.

The model was introduced by Zhao Yang, Hengchang Liu, Chuan Cao, and Bing Su of the Gaoling School of Artificial Intelligence at Renmin University of China, in a preprint posted to arXiv in March 2026 and accepted as a workshop paper at MLGenX 2026. Rather than design a new backbone from scratch, the authors build on the proven Nucleotide Transformer v2 encoder and reformulate its training as discrete diffusion, demonstrating that an established bidirectional encoder can be converted into a capable generative model.

The central result is a substantial improvement in unconditional regulatory-element generation: D3LM reports a Sequence-FID (SFID) of 10.92, compared with 29.16 for a prior autoregressive baseline, approaching the 7.85 reference computed on real genomic DNA.

Key Features

Unified understanding and generation: A single masked discrete diffusion objective supports both bidirectional representation learning and de novo DNA sampling, removing the usual trade-off between encoder-style and decoder-style genomic models.
Bidirectional generation via diffusion: Sequences are produced by iteratively unmasking tokens with full bidirectional context at every step, rather than committing to a single left-to-right pass, which better respects the non-causal structure of regulatory DNA.
Built on Nucleotide Transformer v2: The released D3LM-from-nt checkpoint is initialized from nucleotide-transformer-v2-50m-multi-species and fine-tuned with the diffusion objective, while a D3LM-scratch variant is trained from random initialization for comparison.
Flexible decoding strategies: Generation supports configurable temperature, nucleus sampling, and multiple unmasking schedules (random, entropy, maskgit_plus, topk_margin, p2), giving control over sample diversity and fidelity.
Open weights: Both checkpoints are released on HuggingFace under the Apache 2.0 license.

Technical Details

D3LM is a roughly 56M-parameter transformer encoder (about 50M trainable parameters) with 12 layers, 512 hidden dimensions, 16 attention heads, rotary positional embeddings, and a maximum context of 2,048 tokens over a 4,107-token vocabulary. Training uses a masked diffusion objective on mammalian DNA: tokens are corrupted by masking according to a noise schedule and the model learns to recover them, which at inference time is run in reverse to generate sequences from a fully masked state. The D3LM-from-nt checkpoint warm-starts from the Nucleotide Transformer v2 50M multi-species encoder before diffusion fine-tuning; D3LM-scratch trains the same architecture from random initialization. On unconditional regulatory-element generation, D3LM achieves an SFID of 10.92 versus 29.16 for a comparable autoregressive approach, with a real-DNA reference of 7.85, indicating generated sequences whose feature statistics are markedly closer to genuine regulatory DNA.

Applications

D3LM is aimed at researchers in regulatory genomics and synthetic biology who need to both analyze and design DNA. Its generative side supports de novo design of regulatory elements such as promoters and enhancers, where producing sequences whose statistical properties match real genomic DNA is a prerequisite for downstream synthesis and screening. Because the same model retains a bidirectional encoder, its representations can also be applied to standard understanding tasks such as functional annotation and variant analysis, letting a single model serve both design and characterization workflows. The Apache-2.0 HuggingFace checkpoints can be loaded directly and fine-tuned on task- or organism-specific data.

Impact

D3LM contributes to a growing line of work showing that discrete diffusion is a practical route to generative DNA models that retain the bidirectional context lost by autoregressive approaches. By converting an established bidirectional encoder, the Nucleotide Transformer v2, into a generator rather than designing a bespoke architecture, it offers a reproducible template for upgrading existing genomic encoders with generation capability. The reported gains on regulatory-element generation (SFID 10.92 versus 29.16) are notable, though as a recent workshop preprint the model has a limited published benchmark suite, modest scale (~56M parameters), a 2,048-token context, and training restricted to mammalian DNA; broader evaluation and larger-scale variants remain open directions. No standalone GitHub code repository has been confirmed; the released artifacts are the HuggingFace checkpoints.

Citation

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Preprint

Yang, Z., et al. (2026) D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation.

DOI: 10.48550/arXiv.2603.01780

Recent citations

Papers that recently cited this model.

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows
Jeongchan Kim, Yunkyung Ko, Jong Chul Ye
May 2026
0

Top citations

The most-cited papers that cite this model.

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows
Jeongchan Kim, Yunkyung Ko, Jong Chul Ye
May 2026
0

Citations

Total Citations1

Influential0

References52

HuggingFace

Downloads34

Likes3

Last Modified4mo ago

Pipelinetext-generation

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

58Partial

Usability — can I run it?86

Reproducibility — can I retrain it?18

open weights, closed recipe

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper HuggingFace Model

Key Features

Unified understanding and generation: A single masked discrete diffusion objective supports both bidirectional representation learning and de novo DNA sampling, removing the usual trade-off between encoder-style and decoder-style genomic models.

Bidirectional generation via diffusion: Sequences are produced by iteratively unmasking tokens with full bidirectional context at every step, rather than committing to a single left-to-right pass, which better respects the non-causal structure of regulatory DNA.

Built on Nucleotide Transformer v2: The released D3LM-from-nt checkpoint is initialized from nucleotide-transformer-v2-50m-multi-species and fine-tuned with the diffusion objective, while a D3LM-scratch variant is trained from random initialization for comparison.

Flexible decoding strategies: Generation supports configurable temperature, nucleus sampling, and multiple unmasking schedules (random, entropy, maskgit_plus, topk_margin, p2), giving control over sample diversity and fidelity.

Open weights: Both checkpoints are released on HuggingFace under the Apache 2.0 license.

Technical Details

Applications

Impact

D3LM

Key Features

Technical Details

Applications

Impact

Citation

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Recent citations

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Top citations

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

D3LM

Key Features

Technical Details

Applications

Impact

Citation

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Recent citations

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Top citations

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

D3LM

#Key Features

#Technical Details

#Applications

#Impact

Citation

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Recent citations

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Top citations

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Related models

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

D3LM

#Key Features

#Technical Details

#Applications

#Impact

Citation

D3LM: A Discrete DNA Diffusion Language Model for Bidirectional DNA Understanding and Generation

Recent citations

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Top citations

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Related models

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact