Generative discrete-diffusion model that designs regulatory DNA with tunable activity and learns activity-predictive representations rivaling genomic language models.
D3 (DNA Discrete Diffusion) is a generative framework for designing cis-regulatory DNA sequences with targeted, tunable regulatory activity. Developed by Anirban Sarkar, Peter K. Koo, and colleagues in the Koo Lab at Cold Spring Harbor Laboratory, D3 was first posted to bioRxiv in May 2024 and substantially revised through a third version in April 2026. It addresses a central goal of synthetic biology and functional genomics: generating novel enhancer and promoter sequences that drive a desired level of gene expression in a specific cellular context, rather than merely classifying or scoring existing genomic sequences.
D3 adapts score-entropy discrete diffusion (SEDD) to genomics, generating sequences directly in discrete nucleotide space through an iterative process that refines nucleotide transitions while conditioning on a target activity. This contrasts with earlier DNA diffusion approaches that operate in continuous embedding or one-hot space and often fail to reproduce the combinatorial motif grammar of real regulatory elements. By denoising in nucleotide space, D3 produces sequences that closely recapitulate cell-type-specific activity distributions and realistic transcription-factor motif organization.
Beyond generation, D3 doubles as a representation learner. When trained on regulatory sequence collections without any activity labels, its frozen internal features are predictive of regulatory activity, competing with or surpassing genomic language models and supervised models trained on naive one-hot encodings.
D3 is implemented in PyTorch with both transformer (D3-Tran) and convolutional (D3-Conv) backbones; the transformer variant (12 blocks, 12 heads, hidden size 768) generally performs best. Sampling uses predictor-corrector schemes with Euler and analytic predictors plus a denoiser, and supports motif inpainting. Models are trained across multiple regulatory datasets: DeepSTARR (249 bp Drosophila developmental and housekeeping enhancers), lentiMPRA (230 bp sequences in K562, HepG2, and WTC11), MPRA (200 bp), and promoter sequences (1024 bp). Activity is scored against task-specific oracle models such as DeepSTARR, MPRALegNet, and Sei. Critically, D3-designed sequences were validated experimentally using lentiMPRA in K562 cells, where they retained measurable regulatory activity and more closely matched the activity distribution of genomic sequences than matched diffusion baselines. Pretrained model weights and preprocessed datasets are released on HuggingFace under an MIT license.
D3 supports the rational design of synthetic enhancers and promoters for gene therapy, cell engineering, and synthetic biology, where elements must drive expression at a defined level in a particular cell type. Researchers can use it to generate diverse candidate regulatory sequences for massively parallel reporter assay (MPRA) screens, to study the sequence features underlying context-specific activity, and as a representation learner or data-augmentation engine that strengthens downstream activity-prediction models, especially when labeled data are scarce.
D3 is among the first applications of score-entropy discrete diffusion to regulatory genomics, demonstrating that nucleotide-space diffusion can both design functional regulatory DNA and yield representations competitive with genomic language models. By pairing a principled generative framework with experimental lentiMPRA validation and a reusable benchmarking suite, the work establishes a practical and rigorously evaluated route to designing tunable cis-regulatory elements, and has informed follow-on analyses of the generative dynamics of DNA diffusion models. Its main limitations are short generation lengths and reliance on oracle models for activity scoring.