A transformer foundation model for ATAC-seq that learns embeddings of individual cis-regulatory elements and cells from a large single-cell chromatin accessibility atlas.
Atacformer is a transformer-based foundation model for the analysis and interpretation of ATAC-seq data, developed by the Sheffield lab (databio group) at the University of Virginia and posted to bioRxiv in November 2025. Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) measures genome-wide chromatin accessibility, and single-cell ATAC-seq (scATAC-seq) reveals the regulatory state of individual cells. These data are notoriously sparse and high-dimensional, and most existing methods reduce each cell to a single representation. Atacformer instead learns representations at two levels: it produces embeddings for individual cis-regulatory elements as well as for whole cells.
The model's central idea is to treat genomic intervals as discrete tokens—the "words" of the regulatory genome—so that the strengths of transformer architectures developed for natural language can be brought to bear on chromatin accessibility. Atacformer is pretrained self-supervised on a large atlas of scATAC-seq experiments, then fine-tuned for downstream tasks such as clustering, cell-type annotation, and batch correction. The authors also introduce CRAFT (Contrastive RNA-ATAC Fine-Tuning), a dual-encoder contrastive extension that aligns scATAC-seq and scRNA-seq, enabling cross-modal RNA imputation from accessibility data.
Atacformer joins a growing class of single-cell foundation models, but is distinguished by its focus on the regulatory genome and its element-level embeddings, which connect cell-level analysis back to the specific accessible regions that drive cell identity.
gtars package, with consensus region universes built via geniml.databio HuggingFace organization.Atacformer is an encoder-style transformer of roughly 0.2 billion parameters, pretrained self-supervised on the scatlas dataset—a single-cell atlas of approximately 1.05 million cells assembled from public scATAC-seq experiments. Coverage tracks are converted into consensus region sets using coverage-cutoff and hidden-Markov-model universe-creation methods from the geniml toolkit, and each cell is encoded as a tokenized set of accessible regions over the hg38 reference. On benchmarks, Atacformer matches or exceeds leading scATAC-seq clustering tools in adjusted Rand index while running substantially faster, and when fine-tuned on bulk BED files it recovers cell-type and assay labels with over 80% accuracy. Released checkpoints include the base model (atacformer-base-hg38) plus fine-tuned variants for cell-type prediction and the CRAFT multimodal extension.
Atacformer is aimed at computational biologists and epigenomics researchers working with single-cell or bulk chromatin accessibility data. Typical workflows include clustering and visualization of scATAC-seq experiments, automated cell-type annotation, integration and batch correction across datasets, and—via CRAFT—imputing transcriptomic profiles for cells profiled only by ATAC-seq. Because it operates directly on raw fragment files and emits element-level embeddings, it can also support discovery of the regulatory regions that characterize particular cell types or conditions.
By bringing token-based transformer modeling to the regulatory genome and releasing pretrained weights, Atacformer extends the single-cell foundation model paradigm—dominated by transcriptomic models—into chromatin accessibility, a modality where labeled data are scarce and pretraining is especially valuable. Its element-level representations and large speed advantage make foundation-model analysis practical for ATAC-seq cohorts. Some openness caveats remain at the time of writing: the HuggingFace model repositories lack model cards and the scatlas dataset repository lacks a data card, and licenses for the weights and dataset are not stated (the supporting gtars and geniml code is BSD-2-Clause). As a recent preprint, its results await peer review and broader independent benchmarking.
Leroy, N., et al. (2025) Atacformer: A transformer-based foundation model for analysis and interpretation of ATAC-seq data. bioRxiv.
DOI: 10.1101/2025.11.03.685753Papers that recently cited this model.
Nathan Leroy, Donald R. Campbell, Seth Stadick, et al.
arXiv.org · Nov 2025
The most-cited papers that cite this model.
Nathan Leroy, Donald R. Campbell, Seth Stadick, et al.
arXiv.org · Nov 2025
Share of papers citing this model.