bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Single-cell foundation models
Single-cellDNA & Gene

Atacformer

University of Virginia

A transformer foundation model for ATAC-seq that learns embeddings of individual cis-regulatory elements and cells from a large single-cell chromatin accessibility atlas.

Released: November 2025
Parameters: 200 Million

Atacformer is a transformer-based foundation model for the analysis and interpretation of ATAC-seq data, developed by the Sheffield lab (databio group) at the University of Virginia and posted to bioRxiv in November 2025. Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) measures genome-wide chromatin accessibility, and single-cell ATAC-seq (scATAC-seq) reveals the regulatory state of individual cells. These data are notoriously sparse and high-dimensional, and most existing methods reduce each cell to a single representation. Atacformer instead learns representations at two levels: it produces embeddings for individual cis-regulatory elements as well as for whole cells.

The model's central idea is to treat genomic intervals as discrete tokens—the "words" of the regulatory genome—so that the strengths of transformer architectures developed for natural language can be brought to bear on chromatin accessibility. Atacformer is pretrained self-supervised on a large atlas of scATAC-seq experiments, then fine-tuned for downstream tasks such as clustering, cell-type annotation, and batch correction. The authors also introduce CRAFT (Contrastive RNA-ATAC Fine-Tuning), a dual-encoder contrastive extension that aligns scATAC-seq and scRNA-seq, enabling cross-modal RNA imputation from accessibility data.

Atacformer joins a growing class of single-cell foundation models, but is distinguished by its focus on the regulatory genome and its element-level embeddings, which connect cell-level analysis back to the specific accessible regions that drive cell identity.

#Key Features

  • Element-level embeddings: Unlike models that only produce cell-level vectors, Atacformer generates embeddings for individual cis-regulatory elements, supporting interpretation of which accessible regions distinguish cell states.
  • Region tokenization: Genomic intervals are represented as discrete tokens. The tokenizers are implemented in Rust and exposed through a HuggingFace-compatible Python API in the gtars package, with consensus region universes built via geniml.
  • End-to-end speed: The model processes raw fragment files end-to-end roughly 80% faster than existing scATAC-seq tools while matching or exceeding their clustering accuracy.
  • Cross-modal alignment (CRAFT): A dual-encoder contrastive fine-tuning scheme aligns scATAC-seq with scRNA-seq, enabling imputation of gene expression from chromatin accessibility.
  • Released weights: Pretrained and fine-tuned checkpoints are distributed through the databio HuggingFace organization.

#Technical Details

Atacformer is an encoder-style transformer of roughly 0.2 billion parameters, pretrained self-supervised on the scatlas dataset—a single-cell atlas of approximately 1.05 million cells assembled from public scATAC-seq experiments. Coverage tracks are converted into consensus region sets using coverage-cutoff and hidden-Markov-model universe-creation methods from the geniml toolkit, and each cell is encoded as a tokenized set of accessible regions over the hg38 reference. On benchmarks, Atacformer matches or exceeds leading scATAC-seq clustering tools in adjusted Rand index while running substantially faster, and when fine-tuned on bulk BED files it recovers cell-type and assay labels with over 80% accuracy. Released checkpoints include the base model (atacformer-base-hg38) plus fine-tuned variants for cell-type prediction and the CRAFT multimodal extension.

#Applications

Atacformer is aimed at computational biologists and epigenomics researchers working with single-cell or bulk chromatin accessibility data. Typical workflows include clustering and visualization of scATAC-seq experiments, automated cell-type annotation, integration and batch correction across datasets, and—via CRAFT—imputing transcriptomic profiles for cells profiled only by ATAC-seq. Because it operates directly on raw fragment files and emits element-level embeddings, it can also support discovery of the regulatory regions that characterize particular cell types or conditions.

#Impact

By bringing token-based transformer modeling to the regulatory genome and releasing pretrained weights, Atacformer extends the single-cell foundation model paradigm—dominated by transcriptomic models—into chromatin accessibility, a modality where labeled data are scarce and pretraining is especially valuable. Its element-level representations and large speed advantage make foundation-model analysis practical for ATAC-seq cohorts. Some openness caveats remain at the time of writing: the HuggingFace model repositories lack model cards and the scatlas dataset repository lacks a data card, and licenses for the weights and dataset are not stated (the supporting gtars and geniml code is BSD-2-Clause). As a recent preprint, its results await peer review and broader independent benchmarking.

Citation

Atacformer: A transformer-based foundation model for analysis and interpretation of ATAC-seq data

Preprint

Leroy, N., et al. (2025) Atacformer: A transformer-based foundation model for analysis and interpretation of ATAC-seq data. bioRxiv.

DOI: 10.1101/2025.11.03.685753

Recent citations

Papers that recently cited this model.

  • Fast, memory-efficient genomic interval tokenizers for modern machine learning

    Nathan Leroy, Donald R. Campbell, Seth Stadick, et al.

    arXiv.org · Nov 2025

    1

Top citations

The most-cited papers that cite this model.

  • Fast, memory-efficient genomic interval tokenizers for modern machine learning

    Nathan Leroy, Donald R. Campbell, Seth Stadick, et al.

    arXiv.org · Nov 2025

    1

Citations

Total Citations1
Influential0
References43

GitHub

Stars28
Forks4
Open Issues37
Contributors10
Last Push5d ago
LanguageRust
LicenseBSD-2-Clause

HuggingFace

Downloads11
Likes1
Last Modified1y ago

Fields of citing research

  • Biology100%
  • Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility
32Closed
Usability — can I run it?44
Reproducibility — can I retrain it?9
Model Openness Framework
Unclassified
Restrictive license on core components

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset