Institute of Hydrobiology, Chinese Academy of Sciences
Genomic foundation model for Cypriniformes fish, built on a Mamba-2 state space model with a 32 kb context window for long-range genome modeling.
Genomic foundation models have largely been trained on human, mammalian, or broadly cross-species corpora, leaving many specific taxonomic groups underrepresented. FishMamba-1 targets this gap for Cypriniformes — the order that includes carp, zebrafish, and many other freshwater fishes — by building a dedicated DNA foundation model on genomes from this clade. It is described by its authors as the first genomic foundation model specifically for Cypriniformes.
FishMamba-1 was developed at the Institute of Hydrobiology, Chinese Academy of Sciences, and released as a preprint in March 2026. Rather than adopting a Transformer backbone, it is built on Mamba-2, a selective state space model (SSM) whose linear-time sequence processing scales to long genomic windows far more efficiently than the quadratic attention of Transformers. This lets the model take in a 32,768-base-pair context — roughly 5–8× longer than many standard DNA models — which is valuable for capturing long-range regulatory and structural features, including in polyploid fish genomes.
By combining a clade-specific training corpus with a long-context SSM architecture, FishMamba-1 aims to provide accurate sequence understanding for aquatic genomics, a field of considerable importance for aquaculture, fisheries, and evolutionary biology.
FishMamba-1 is a 124-million-parameter model built on the Mamba-2 selective state space architecture, which replaces self-attention with state space recurrence to achieve O(N) rather than O(N²) scaling in sequence length. It was pretrained on the Cypri-24 corpus — 28.8 Gb of genomic sequence from 24 Cypriniformes species, comprising roughly 15 billion tokens, with most species assembled to chromosome level and a subset carrying high-quality gene-structure annotations. Sequences are encoded with byte-pair-encoding tokenization, and the model supports inputs up to 32,768 bp. On the downstream genome-segmentation task, FishMamba-1 reports exon identification precision of 64.6% and overall accuracy near 66.6% across a seven-class functional-element scheme. The codebase is MIT-licensed; the preprint itself is distributed under a CC BY-NC-ND license. Training is designed to fit on a single NVIDIA A100 (80 GB) GPU.
FishMamba-1 supports genome annotation and functional-element prediction for Cypriniformes and related aquatic species, including exon, intron, and promoter identification. It is useful to researchers in aquaculture, fisheries genomics, and fish evolutionary biology who need a sequence model attuned to fish genomes rather than a human-centric foundation model. Released weights and a web inference interface lower the barrier for biologists to apply the model without extensive ML infrastructure.
FishMamba-1 illustrates two trends in genomic AI: the move toward clade-specific foundation models that specialize where general models are weak, and the adoption of state space architectures to extend context length efficiently. By providing the first dedicated foundation model for Cypriniformes with open weights and a web demo, it offers aquatic-genomics researchers a ready-to-use long-context tool. Its segmentation accuracy, while a useful baseline, leaves headroom for improvement, and its specialization to one fish order means generalization beyond Cypriniformes is not expected. As a recent preprint, broader benchmarking and community adoption are still developing.