Bidirectional state-space (Mamba-2) genomic model for ultra-long extrachromosomal circular DNA, scaling linearly with sequence length.
Extrachromosomal circular DNA (eccDNA) is a class of circular genetic elements that form outside chromosomes and play an increasingly recognized role in cancer, where amplified oncogenes on eccDNA drive tumor heterogeneity and drug resistance. Modeling these sequences is difficult: individual eccDNA molecules can span tens of kilobases, and their circular topology has no natural start or end. Existing genomic foundation models either rely on attention mechanisms whose cost grows quadratically with sequence length or truncate molecules into kilobase fragments, breaking sequence continuity.
eccDNAMamba, introduced in late 2025 by researchers at Brown University, is the first bidirectional state-space model purpose-built for eccDNA. It is based on the Mamba-2 architecture, whose selective state-space design scales linearly with input length, allowing the model to ingest ultra-long sequences in a single pass rather than chopping them into pieces. To respect the circular nature of eccDNA, the authors introduce a circular augmentation strategy that preserves the topology of each molecule during pretraining.
The model fills a gap left by chromosome-oriented genomic language models, which were not designed for the length and circularity of eccDNA. By pairing linear-time sequence modeling with topology-aware augmentation, eccDNAMamba offers a representation-learning backbone tailored to this emerging class of cancer-relevant genetic elements.
eccDNAMamba uses a BiMambaForMaskedLM formulation built on the Mamba-2 state-space
architecture, pretrained with masked-language-style objectives on eccDNA sequences and
released at roughly 0.5B parameters. The circular augmentation strategy rotates sequences
to preserve eccDNA topology during training. The authors evaluate transfer to several
downstream tasks, including cancer versus healthy eccDNA classification, copy-number-level
prediction at multiple thresholds, and real-versus-pseudo eccDNA discrimination across
Homo sapiens, Gallus gallus, and Arabidopsis thaliana datasets. The reported results show
the model outperforming existing genomic foundation models on cancer discrimination and
copy-number prediction. The implementation is in PyTorch with the mamba_ssm and
causal_conv1d kernels, and checkpoints and datasets are distributed on HuggingFace.
eccDNAMamba is aimed at cancer genomics researchers studying oncogene amplification, tumor heterogeneity, and treatment resistance driven by extrachromosomal DNA. Its ability to classify cancer versus healthy eccDNA and predict copy-number levels makes it useful for analyzing sequencing data where eccDNA content may serve as a biomarker, and its species-spanning real-versus-pseudo classifiers support filtering and validation of detected circular elements. As a pretrained backbone, it can be fine-tuned for new eccDNA classification or regression tasks with relatively small labeled datasets.
eccDNAMamba is an early example of adapting state-space architectures to a specialized genomic problem where sequence length and circular topology defeat standard attention-based genomic language models. By releasing open weights and curated datasets, the authors lower the barrier to eccDNA-focused modeling and provide a reusable foundation for a growing area of cancer research. As a recent preprint, its benchmark comparisons and downstream adoption remain to be validated by the broader community, and its evaluation focuses on the specific eccDNA tasks the authors curated rather than a wide genomic benchmark suite.