Overview

Caduceus is a family of DNA foundation models developed by the Kuleshov Lab at Cornell University and introduced at ICML 2024. It directly addresses two biological properties of genomic sequences that most prior models ignored: the bidirectionality of regulatory information and the reverse-complement (RC) symmetry inherent to double-stranded DNA. A transcription factor binding site, for instance, is functionally identical whether read from the sense or antisense strand — yet the dominant generation of DNA language models (HyenaDNA, Nucleotide Transformer) processed sequences unidirectionally and had no mechanism to enforce this symmetry.

Caduceus is built on the Mamba selective state space model (SSM) rather than transformer attention, which enables linear-time sequence processing and makes context windows up to 131,072 base pairs computationally feasible. The architecture introduces two intermediate building blocks — BiMamba, which adds bidirectional processing to Mamba, and MambaDNA, which further enforces RC equivariance — before assembling the full Caduceus models from these components. The central empirical claim is that encoding these inductive biases into the architecture provides information-theoretic advantages that increased parameter count alone cannot replicate.

Two model variants are released: Caduceus-PS applies RC equivariance through parameter sharing at the embedding and language model head level, while Caduceus-Ph applies RC equivariance post-hoc during inference by averaging predictions from both strands. Caduceus-Ph generally achieves the highest accuracy on classification tasks; Caduceus-PS shows the strongest advantage on long-range variant effect prediction.

Key Features

Reverse-complement equivariance: Predictions are invariant to which DNA strand is presented as input, reflecting the physical reality of double-stranded DNA and improving generalization from limited training data.
Bidirectional context via BiMamba: A bidirectional extension of the Mamba SSM block processes sequences in both the forward and reverse directions simultaneously, capturing upstream and downstream regulatory context within a single pass.
Linear-time long-range modeling: Handles sequences up to 131,072 base pairs with linear memory scaling, unlike transformer-based models that scale quadratically and become prohibitive at this context length.
Two complementary variants: Caduceus-Ph is preferred for regulatory element classification; Caduceus-PS is preferred for generative and long-range prediction tasks.
Efficient model scale: Released checkpoints range from approximately 470K to 7.73M parameters, making fine-tuning feasible on modest GPU hardware without sacrificing benchmark competitiveness.
Human genome pre-training: All models are pre-trained on the full human reference genome (HG38) at single-nucleotide resolution using a masked language modeling objective.

Technical Details

Caduceus stacks MambaDNA layers — each combining Mamba's input-dependent selective state updates with bidirectional processing and RC equivariance — and pre-trains using a character-level masked language modeling (MLM) objective over the human reference genome (HG38). The genome is partitioned into approximately 34,021 non-overlapping segments covering roughly 3.5 billion base pairs of training data. Each nucleotide (A, C, G, T, N) is a single token, providing single-base-pair resolution.

Released checkpoints span a compact parameter range: from a 4-layer, d_model=118 variant at ~470K parameters to a 16-layer, d_model=256 variant at 7.73M parameters with a 131K base pair context window. On GenomicBenchmarks (eight regulatory element classification tasks), Caduceus-Ph achieves top accuracy across all tasks, outperforming HyenaDNA and transformer baselines. On 18 Nucleotide Transformer benchmark datasets — covering histone modification prediction, regulatory annotation, and splice site detection — Caduceus models match or exceed the 500M-parameter Nucleotide Transformer despite being roughly 65x smaller. On long-range variant effect prediction tasks (sequences >100K bp), Caduceus-PS consistently outperforms all prior methods, including those with 10x more parameters, directly validating the value of the architectural symmetry constraints.

Applications

Caduceus is well suited to any task where regulatory context, strand symmetry, or long-range sequence dependencies matter. Its primary demonstrated applications are variant effect prediction — scoring the functional impact of single nucleotide variants on gene expression, splicing, or regulatory activity — and regulatory element classification, including identification of promoters, enhancers, and transcription factor binding sites. The linear-time architecture makes it particularly valuable for long-range tasks where transformer-based models are computationally intractable. Pre-trained HuggingFace checkpoints can be fine-tuned on custom genomic tasks, and because RC equivariance is genome-agnostic, the approach can in principle be extended to any organism's reference genome with additional pre-training.

Impact

Caduceus establishes that encoding biologically motivated symmetries — bidirectionality and reverse-complement equivariance — into DNA model architectures provides consistent, measurable gains over simply increasing parameter count. The work directly influenced subsequent efforts to build more biologically grounded genomic foundation models and demonstrated that state space models are a viable, often superior alternative to transformers for long-range DNA sequence modeling. A notable limitation is that released checkpoints are pre-trained exclusively on the human reference genome; multi-species generalization requires additional fine-tuning. The intentionally compact model scale also means the approach has not been benchmarked at the largest scales of competing systems such as the 500M-parameter Nucleotide Transformer. The Mamba architecture also requires NVIDIA-specific CUDA kernels, which limits portability to non-NVIDIA hardware.

Overview

Key Features

Reverse-complement equivariance: Predictions are invariant to which DNA strand is presented as input, reflecting the physical reality of double-stranded DNA and improving generalization from limited training data.

Bidirectional context via BiMamba: A bidirectional extension of the Mamba SSM block processes sequences in both the forward and reverse directions simultaneously, capturing upstream and downstream regulatory context within a single pass.

Linear-time long-range modeling: Handles sequences up to 131,072 base pairs with linear memory scaling, unlike transformer-based models that scale quadratically and become prohibitive at this context length.

Two complementary variants: Caduceus-Ph is preferred for regulatory element classification; Caduceus-PS is preferred for generative and long-range prediction tasks.

Efficient model scale: Released checkpoints range from approximately 470K to 7.73M parameters, making fine-tuning feasible on modest GPU hardware without sacrificing benchmark competitiveness.

Human genome pre-training: All models are pre-trained on the full human reference genome (HG38) at single-nucleotide resolution using a masked language modeling objective.

Technical Details

Applications

Impact

Caduceus

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources

Caduceus

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources