A short-context masked DNA language model trained on curated regulatory sequences with a motif-discovery regularizer for zero-shot TF motif recovery and variant effect prediction.
ARSENAL is a masked DNA language model for regulatory genomics, developed in Anshul Kundaje's lab at Stanford University and posted to bioRxiv in early 2026. Many genomic language models pursue ever-longer context windows trained across entire genomes; ARSENAL takes the opposite stance, using short contexts focused on a functionally curated set of regulatory sequences. The motivation is that regulatory grammar — the arrangement of transcription factor (TF) binding motifs that controls gene expression — is largely a local phenomenon, so a model concentrated on regulatory regions can learn it more effectively than one diluted across whole genomes.
A distinctive ingredient is a motif-discovery regularizer added during training, which encourages the model to organize its internal representations around interpretable sequence motifs. This makes ARSENAL not only predictive but also more amenable to extracting the TF motifs it has learned, without requiring motif annotations as supervision.
The model is presented as a versatile regulatory-genomics tool: it recovers TF motifs zero-shot, predicts the functional effects of regulatory variants, improves chromatin accessibility prediction across cell types, and can act generatively to design regulatory sequences with targeted properties.
ARSENAL is a masked language model over DNA trained on a curated regulatory-sequence dataset using short contexts, with an added regularizer that promotes motif discovery. The preprint reports that ARSENAL outperforms existing approaches at recovering TF motifs without annotations, predicting regulatory-variant effects, and enhancing chromatin accessibility predictions across cell types, and that it can generate regulatory sequences with desired functional properties. Pretrained models and data are hosted on Synapse (syn72351987), and training code, variant-scoring and sequence-generation notebooks, TF-MoDISco motif analyses, ChromBPNet integration, and DART-EVAL benchmarking utilities are provided in the GitHub repository. The work is released under a CC BY license. As a recent preprint, exact parameter counts and the repository license should be confirmed against the latest sources.
ARSENAL serves regulatory-genomics researchers who need to interpret non-coding DNA: discovering TF motifs without prior annotation, scoring how non-coding variants affect gene regulation, predicting cell-type-specific chromatin accessibility, and designing synthetic regulatory elements. These capabilities are directly relevant to functional genomics, variant interpretation, and synthetic-biology applications.
ARSENAL argues that, for regulatory genomics, careful data curation and inductive biases toward motif structure can outperform brute-force long-context genome pretraining. By coupling zero-shot interpretability with competitive performance on variant-effect and accessibility tasks, it offers a practical and interpretable alternative within the DNA language model landscape. As a recent preprint, its broader adoption and independent validation remain to be established.