D3 (DNA Discrete Diffusion)

Discrete diffusion model that designs regulatory DNA with tunable cell-type-specific activity and learns activity-predictive representations.

Released: April 2026

D3 (DNA Discrete Diffusion) is a generative framework for designing cis-regulatory DNA sequences with targeted, tunable regulatory activity. Developed by Anirban Sarkar, Peter K. Koo, and colleagues in the Koo Lab at Cold Spring Harbor Laboratory, D3 was first posted to bioRxiv in May 2024 and substantially revised through a third version in April 2026. It addresses a central goal of synthetic biology and functional genomics: generating novel enhancer and promoter sequences that drive a desired level of gene expression in a specific cellular context, rather than merely classifying or scoring existing genomic sequences.

D3 adapts score-entropy discrete diffusion (SEDD) to genomics, generating sequences directly in discrete nucleotide space through an iterative process that refines nucleotide transitions while conditioning on a target activity. This contrasts with earlier DNA diffusion approaches that operate in continuous embedding or one-hot space and often fail to reproduce the combinatorial motif grammar of real regulatory elements. By denoising in nucleotide space, D3 produces sequences that closely recapitulate cell-type-specific activity distributions and realistic transcription-factor motif organization.

Beyond generation, D3 doubles as a representation learner. When trained on regulatory sequence collections without any activity labels, its frozen internal features are predictive of regulatory activity, competing with or surpassing genomic language models and supervised models trained on naive one-hot encodings.

Key Features

Tunable conditional generation: Samples regulatory sequences conditioned on scalar activity targets, full activity profiles, or categorical cell-type labels, enabling design toward a specified expression level.
Discrete diffusion in nucleotide space: Extends score-entropy discrete diffusion (SEDD) to DNA, iteratively refining nucleotide transitions to capture combinatorial cis-regulatory grammars that continuous diffusion baselines miss.
Strong frozen representations: Learns activity-predictive embeddings even without conditioning labels, outperforming genomic language models and one-hot supervised baselines on downstream prediction.
Data augmentation and low-data robustness: Maintains performance in low-data regimes and improves downstream supervised models when its generated sequences are used to augment training data.
Comprehensive benchmarking: Introduces an evaluation framework spanning functional activity, sequence composition, synthetic-versus-real discriminability, and memorization.

Technical Details

D3 is implemented in PyTorch with both transformer (D3-Tran) and convolutional (D3-Conv) backbones; the transformer variant (12 blocks, 12 heads, hidden size 768) generally performs best. Sampling uses predictor-corrector schemes with Euler and analytic predictors plus a denoiser, and supports motif inpainting. Models are trained across multiple regulatory datasets: DeepSTARR (249 bp Drosophila developmental and housekeeping enhancers), lentiMPRA (230 bp sequences in K562, HepG2, and WTC11), MPRA (200 bp), and promoter sequences (1024 bp). Activity is scored against task-specific oracle models such as DeepSTARR, MPRALegNet, and Sei. Critically, D3-designed sequences were validated experimentally using lentiMPRA in K562 cells, where they retained measurable regulatory activity and more closely matched the activity distribution of genomic sequences than matched diffusion baselines. Pretrained model weights and preprocessed datasets are released on HuggingFace under an MIT license.

Applications

D3 supports the rational design of synthetic enhancers and promoters for gene therapy, cell engineering, and synthetic biology, where elements must drive expression at a defined level in a particular cell type. Researchers can use it to generate diverse candidate regulatory sequences for massively parallel reporter assay (MPRA) screens, to study the sequence features underlying context-specific activity, and as a representation learner or data-augmentation engine that strengthens downstream activity-prediction models, especially when labeled data are scarce.

Impact

D3 is among the first applications of score-entropy discrete diffusion to regulatory genomics, demonstrating that nucleotide-space diffusion can both design functional regulatory DNA and yield representations competitive with genomic language models. By pairing a principled generative framework with experimental lentiMPRA validation and a reusable benchmarking suite, the work establishes a practical and rigorously evaluated route to designing tunable cis-regulatory elements, and has informed follow-on analyses of the generative dynamics of DNA diffusion models. Its main limitations are short generation lengths and reliance on oracle models for activity scoring.

Citation

Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

Preprint

Sarkar, A., et al. (2026) Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion. bioRxiv.

DOI: 10.1101/2024.05.23.595630

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References53

GitHub

Stars17

Forks9

Open Issues1

Contributors3

Last Push1mo ago

LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

49Partial

Usability — can I run it?59

Reproducibility — can I retrain it?52

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper

Key Features

Tunable conditional generation: Samples regulatory sequences conditioned on scalar activity targets, full activity profiles, or categorical cell-type labels, enabling design toward a specified expression level.

Discrete diffusion in nucleotide space: Extends score-entropy discrete diffusion (SEDD) to DNA, iteratively refining nucleotide transitions to capture combinatorial cis-regulatory grammars that continuous diffusion baselines miss.

Strong frozen representations: Learns activity-predictive embeddings even without conditioning labels, outperforming genomic language models and one-hot supervised baselines on downstream prediction.

Data augmentation and low-data robustness: Maintains performance in low-data regimes and improves downstream supervised models when its generated sequences are used to augment training data.

Comprehensive benchmarking: Introduces an evaluation framework spanning functional activity, sequence composition, synthetic-versus-real discriminability, and memorization.

Technical Details

Applications

Impact

D3 (DNA Discrete Diffusion)

Key Features

Technical Details

Applications

Impact

Citation

Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

D3 (DNA Discrete Diffusion)

Key Features

Technical Details

Applications

Impact

Citation

Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

D3 (DNA Discrete Diffusion)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

D3 (DNA Discrete Diffusion)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Designing DNA With Tunable Regulatory Activity Using Discrete Diffusion

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact