bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene

D3 (DNA Discrete Diffusion)

Cold Spring Harbor Laboratory

Generative discrete-diffusion model that designs regulatory DNA with tunable activity and learns activity-predictive representations rivaling genomic language models.

Released: April 2026

D3 (DNA Discrete Diffusion) is a generative framework for designing cis-regulatory DNA sequences with targeted, tunable regulatory activity. Developed by Anirban Sarkar, Peter K. Koo, and colleagues in the Koo Lab at Cold Spring Harbor Laboratory, D3 was first posted to bioRxiv in May 2024 and substantially revised through a third version in April 2026. It addresses a central goal of synthetic biology and functional genomics: generating novel enhancer and promoter sequences that drive a desired level of gene expression in a specific cellular context, rather than merely classifying or scoring existing genomic sequences.

D3 adapts score-entropy discrete diffusion (SEDD) to genomics, generating sequences directly in discrete nucleotide space through an iterative process that refines nucleotide transitions while conditioning on a target activity. This contrasts with earlier DNA diffusion approaches that operate in continuous embedding or one-hot space and often fail to reproduce the combinatorial motif grammar of real regulatory elements. By denoising in nucleotide space, D3 produces sequences that closely recapitulate cell-type-specific activity distributions and realistic transcription-factor motif organization.

Beyond generation, D3 doubles as a representation learner. When trained on regulatory sequence collections without any activity labels, its frozen internal features are predictive of regulatory activity, competing with or surpassing genomic language models and supervised models trained on naive one-hot encodings.

#Key Features

  • Tunable conditional generation: Samples regulatory sequences conditioned on scalar activity targets, full activity profiles, or categorical cell-type labels, enabling design toward a specified expression level.
  • Discrete diffusion in nucleotide space: Extends score-entropy discrete diffusion (SEDD) to DNA, iteratively refining nucleotide transitions to capture combinatorial cis-regulatory grammars that continuous diffusion baselines miss.
  • Strong frozen representations: Learns activity-predictive embeddings even without conditioning labels, outperforming genomic language models and one-hot supervised baselines on downstream prediction.
  • Data augmentation and low-data robustness: Maintains performance in low-data regimes and improves downstream supervised models when its generated sequences are used to augment training data.
  • Comprehensive benchmarking: Introduces an evaluation framework spanning functional activity, sequence composition, synthetic-versus-real discriminability, and memorization.

#Technical Details

D3 is implemented in PyTorch with both transformer (D3-Tran) and convolutional (D3-Conv) backbones; the transformer variant (12 blocks, 12 heads, hidden size 768) generally performs best. Sampling uses predictor-corrector schemes with Euler and analytic predictors plus a denoiser, and supports motif inpainting. Models are trained across multiple regulatory datasets: DeepSTARR (249 bp Drosophila developmental and housekeeping enhancers), lentiMPRA (230 bp sequences in K562, HepG2, and WTC11), MPRA (200 bp), and promoter sequences (1024 bp). Activity is scored against task-specific oracle models such as DeepSTARR, MPRALegNet, and Sei. Critically, D3-designed sequences were validated experimentally using lentiMPRA in K562 cells, where they retained measurable regulatory activity and more closely matched the activity distribution of genomic sequences than matched diffusion baselines. Pretrained model weights and preprocessed datasets are released on HuggingFace under an MIT license.

#Applications

D3 supports the rational design of synthetic enhancers and promoters for gene therapy, cell engineering, and synthetic biology, where elements must drive expression at a defined level in a particular cell type. Researchers can use it to generate diverse candidate regulatory sequences for massively parallel reporter assay (MPRA) screens, to study the sequence features underlying context-specific activity, and as a representation learner or data-augmentation engine that strengthens downstream activity-prediction models, especially when labeled data are scarce.

#Impact

D3 is among the first applications of score-entropy discrete diffusion to regulatory genomics, demonstrating that nucleotide-space diffusion can both design functional regulatory DNA and yield representations competitive with genomic language models. By pairing a principled generative framework with experimental lentiMPRA validation and a reusable benchmarking suite, the work establishes a practical and rigorously evaluated route to designing tunable cis-regulatory elements, and has informed follow-on analyses of the generative dynamics of DNA diffusion models. Its main limitations are short generation lengths and reliance on oracle models for activity scoring.

Tags

regulatory_genomicsde_novo_designgene_expressiondiffusiontransformergenerativerepresentation_learninggenomics