ARSENAL

Masked DNA language model for regulatory genomics with a motif-discovery regularizer for zero-shot TF motif recovery and variant effect prediction.

Released: February 2026

ARSENAL is a masked DNA language model for regulatory genomics, developed in Anshul Kundaje's lab at Stanford University and posted to bioRxiv in early 2026. Many genomic language models pursue ever-longer context windows trained across entire genomes; ARSENAL takes the opposite stance, using short contexts focused on a functionally curated set of regulatory sequences. The motivation is that regulatory grammar — the arrangement of transcription factor (TF) binding motifs that controls gene expression — is largely a local phenomenon, so a model concentrated on regulatory regions can learn it more effectively than one diluted across whole genomes.

A distinctive ingredient is a motif-discovery regularizer added during training, which encourages the model to organize its internal representations around interpretable sequence motifs. This makes ARSENAL not only predictive but also more amenable to extracting the TF motifs it has learned, without requiring motif annotations as supervision.

The model is presented as a versatile regulatory-genomics tool: it recovers TF motifs zero-shot, predicts the functional effects of regulatory variants, improves chromatin accessibility prediction across cell types, and can act generatively to design regulatory sequences with targeted properties.

Key Features

Short-context regulatory focus: Trained on a functionally curated regulatory dataset with short contexts, rather than whole genomes, to concentrate on regulatory grammar.
Motif-discovery regularization: A specialized regularizer steers representations toward interpretable motifs, enabling annotation-free motif recovery.
Zero-shot TF motif recovery: Recovers transcription factor motifs without motif-level supervision.
Variant effect and accessibility: Predicts the impact of regulatory variants on expression and improves chromatin accessibility prediction across cell types.
Generative design: Can be used to design regulatory sequences meeting specified functional requirements.

Technical Details

ARSENAL is a masked language model over DNA trained on a curated regulatory-sequence dataset using short contexts, with an added regularizer that promotes motif discovery. The preprint reports that ARSENAL outperforms existing approaches at recovering TF motifs without annotations, predicting regulatory-variant effects, and enhancing chromatin accessibility predictions across cell types, and that it can generate regulatory sequences with desired functional properties. Pretrained models and data are hosted on Synapse (syn72351987), and training code, variant-scoring and sequence-generation notebooks, TF-MoDISco motif analyses, ChromBPNet integration, and DART-EVAL benchmarking utilities are provided in the GitHub repository. The work is released under a CC BY license. As a recent preprint, exact parameter counts and the repository license should be confirmed against the latest sources.

Applications

ARSENAL serves regulatory-genomics researchers who need to interpret non-coding DNA: discovering TF motifs without prior annotation, scoring how non-coding variants affect gene regulation, predicting cell-type-specific chromatin accessibility, and designing synthetic regulatory elements. These capabilities are directly relevant to functional genomics, variant interpretation, and synthetic-biology applications.

Impact

ARSENAL argues that, for regulatory genomics, careful data curation and inductive biases toward motif structure can outperform brute-force long-context genome pretraining. By coupling zero-shot interpretability with competitive performance on variant-effect and accessibility tasks, it offers a practical and interpretable alternative within the DNA language model landscape. As a recent preprint, its broader adoption and independent validation remain to be established.

Citation

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

Patel, A. & Kundaje, A. B. (2026) Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization. bioRxiv.

DOI: 10.64898/2026.02.05.703637

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References36

GitHub

Stars16

Forks0

Open Issues1

Contributors1

Last Push19d ago

LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

29Closed

Usability — can I run it?26

Reproducibility — can I retrain it?18

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Dataset

Key Features

Short-context regulatory focus: Trained on a functionally curated regulatory dataset with short contexts, rather than whole genomes, to concentrate on regulatory grammar.

Motif-discovery regularization: A specialized regularizer steers representations toward interpretable motifs, enabling annotation-free motif recovery.

Zero-shot TF motif recovery: Recovers transcription factor motifs without motif-level supervision.

Variant effect and accessibility: Predicts the impact of regulatory variants on expression and improves chromatin accessibility prediction across cell types.

Generative design: Can be used to design regulatory sequences meeting specified functional requirements.

Technical Details

Applications

Impact

ARSENAL

Key Features

Technical Details

Applications

Impact

Citation

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ARSENAL

Key Features

Technical Details

Applications

Impact

Citation

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ARSENAL

#Key Features

#Technical Details

#Applications

#Impact

Citation

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

ARSENAL

#Key Features

#Technical Details

#Applications

#Impact

Citation

Short-Context Regulatory DNA Language Models with Motif-Discovery Regularization

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact