bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

ARSENAL

Stanford University

A short-context masked DNA language model trained on curated regulatory sequences with a motif-discovery regularizer for zero-shot TF motif recovery and variant effect prediction.

Released: February 2026

ARSENAL is a masked DNA language model for regulatory genomics, developed in Anshul Kundaje's lab at Stanford University and posted to bioRxiv in early 2026. Many genomic language models pursue ever-longer context windows trained across entire genomes; ARSENAL takes the opposite stance, using short contexts focused on a functionally curated set of regulatory sequences. The motivation is that regulatory grammar — the arrangement of transcription factor (TF) binding motifs that controls gene expression — is largely a local phenomenon, so a model concentrated on regulatory regions can learn it more effectively than one diluted across whole genomes.

A distinctive ingredient is a motif-discovery regularizer added during training, which encourages the model to organize its internal representations around interpretable sequence motifs. This makes ARSENAL not only predictive but also more amenable to extracting the TF motifs it has learned, without requiring motif annotations as supervision.

The model is presented as a versatile regulatory-genomics tool: it recovers TF motifs zero-shot, predicts the functional effects of regulatory variants, improves chromatin accessibility prediction across cell types, and can act generatively to design regulatory sequences with targeted properties.

#Key Features

  • Short-context regulatory focus: Trained on a functionally curated regulatory dataset with short contexts, rather than whole genomes, to concentrate on regulatory grammar.
  • Motif-discovery regularization: A specialized regularizer steers representations toward interpretable motifs, enabling annotation-free motif recovery.
  • Zero-shot TF motif recovery: Recovers transcription factor motifs without motif-level supervision.
  • Variant effect and accessibility: Predicts the impact of regulatory variants on expression and improves chromatin accessibility prediction across cell types.
  • Generative design: Can be used to design regulatory sequences meeting specified functional requirements.

#Technical Details

ARSENAL is a masked language model over DNA trained on a curated regulatory-sequence dataset using short contexts, with an added regularizer that promotes motif discovery. The preprint reports that ARSENAL outperforms existing approaches at recovering TF motifs without annotations, predicting regulatory-variant effects, and enhancing chromatin accessibility predictions across cell types, and that it can generate regulatory sequences with desired functional properties. Pretrained models and data are hosted on Synapse (syn72351987), and training code, variant-scoring and sequence-generation notebooks, TF-MoDISco motif analyses, ChromBPNet integration, and DART-EVAL benchmarking utilities are provided in the GitHub repository. The work is released under a CC BY license. As a recent preprint, exact parameter counts and the repository license should be confirmed against the latest sources.

#Applications

ARSENAL serves regulatory-genomics researchers who need to interpret non-coding DNA: discovering TF motifs without prior annotation, scoring how non-coding variants affect gene regulation, predicting cell-type-specific chromatin accessibility, and designing synthetic regulatory elements. These capabilities are directly relevant to functional genomics, variant interpretation, and synthetic-biology applications.

#Impact

ARSENAL argues that, for regulatory genomics, careful data curation and inductive biases toward motif structure can outperform brute-force long-context genome pretraining. By coupling zero-shot interpretability with competitive performance on variant-effect and accessibility tasks, it offers a practical and interpretable alternative within the DNA language model landscape. As a recent preprint, its broader adoption and independent validation remain to be established.

Tags

variant_effect_predictionmotif_discoveryregulatory_sequence_designtransformerlanguage_modelself_supervisedzero_shotregulatory_dnachromatin