FishMamba-1

Institute of Hydrobiology, Chinese Academy of Sciences

Genomic foundation model for Cypriniformes fish, built on a Mamba-2 state space model with a 32 kb context window for long-range genome modeling.

Released: March 2026

Parameters: 124 Million

Genomic foundation models have largely been trained on human, mammalian, or broadly cross-species corpora, leaving many specific taxonomic groups underrepresented. FishMamba-1 targets this gap for Cypriniformes — the order that includes carp, zebrafish, and many other freshwater fishes — by building a dedicated DNA foundation model on genomes from this clade. It is described by its authors as the first genomic foundation model specifically for Cypriniformes.

FishMamba-1 was developed at the Institute of Hydrobiology, Chinese Academy of Sciences, and released as a preprint in March 2026. Rather than adopting a Transformer backbone, it is built on Mamba-2, a selective state space model (SSM) whose linear-time sequence processing scales to long genomic windows far more efficiently than the quadratic attention of Transformers. This lets the model take in a 32,768-base-pair context — roughly 5–8× longer than many standard DNA models — which is valuable for capturing long-range regulatory and structural features, including in polyploid fish genomes.

By combining a clade-specific training corpus with a long-context SSM architecture, FishMamba-1 aims to provide accurate sequence understanding for aquatic genomics, a field of considerable importance for aquaculture, fisheries, and evolutionary biology.

Key Features

Mamba-2 state space backbone: A selective SSM provides linear-complexity sequence modeling, enabling efficient processing of very long genomic inputs.
32 kb context window: FishMamba-1 ingests up to 32,768 base pairs at once, 5–8× longer than many Transformer-based DNA models, supporting long-range dependency modeling.
Clade-specific pretraining: Trained on the Cypri-24 corpus spanning 24 Cypriniformes species, the model specializes in fish genomics rather than general-purpose DNA.
Genome segmentation fine-tuning: A downstream FishSegmenter task classifies functional elements (exons, introns, promoters, and other categories) across the genome.
Open weights and web inference: Pretrained weights are released on Hugging Face under an MIT-licensed codebase, with an interactive web demo for inference.

Technical Details

FishMamba-1 is a 124-million-parameter model built on the Mamba-2 selective state space architecture, which replaces self-attention with state space recurrence to achieve O(N) rather than O(N²) scaling in sequence length. It was pretrained on the Cypri-24 corpus — 28.8 Gb of genomic sequence from 24 Cypriniformes species, comprising roughly 15 billion tokens, with most species assembled to chromosome level and a subset carrying high-quality gene-structure annotations. Sequences are encoded with byte-pair-encoding tokenization, and the model supports inputs up to 32,768 bp. On the downstream genome-segmentation task, FishMamba-1 reports exon identification precision of 64.6% and overall accuracy near 66.6% across a seven-class functional-element scheme. The codebase is MIT-licensed; the preprint itself is distributed under a CC BY-NC-ND license. Training is designed to fit on a single NVIDIA A100 (80 GB) GPU.

Applications

FishMamba-1 supports genome annotation and functional-element prediction for Cypriniformes and related aquatic species, including exon, intron, and promoter identification. It is useful to researchers in aquaculture, fisheries genomics, and fish evolutionary biology who need a sequence model attuned to fish genomes rather than a human-centric foundation model. Released weights and a web inference interface lower the barrier for biologists to apply the model without extensive ML infrastructure.

Impact

FishMamba-1 illustrates two trends in genomic AI: the move toward clade-specific foundation models that specialize where general models are weak, and the adoption of state space architectures to extend context length efficiently. By providing the first dedicated foundation model for Cypriniformes with open weights and a web demo, it offers aquatic-genomics researchers a ready-to-use long-context tool. Its segmentation accuracy, while a useful baseline, leaves headroom for improvement, and its specialization to one fish order means generalization beyond Cypriniformes is not expected. As a recent preprint, broader benchmarking and community adoption are still developing.

Citation

FishMamba-1: A Linear-Complexity Foundation Model for Deciphering Polyploid Cyprinid Genomes

Lu, S., et al. (2026) FishMamba-1: A Linear-Complexity Foundation Model for Deciphering Polyploid Cyprinid Genomes. bioRxiv.

DOI: 10.64898/2026.03.09.710409

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References46

GitHub

Stars0

Forks0

Open Issues0

Contributors0

Last Push5mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads12

Likes1

Last Modified4mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

50Partial

Usability — can I run it?59

Reproducibility — can I retrain it?51

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Mamba-2 state space backbone: A selective SSM provides linear-complexity sequence modeling, enabling efficient processing of very long genomic inputs.

32 kb context window: FishMamba-1 ingests up to 32,768 base pairs at once, 5–8× longer than many Transformer-based DNA models, supporting long-range dependency modeling.

Clade-specific pretraining: Trained on the Cypri-24 corpus spanning 24 Cypriniformes species, the model specializes in fish genomics rather than general-purpose DNA.

Genome segmentation fine-tuning: A downstream FishSegmenter task classifies functional elements (exons, introns, promoters, and other categories) across the genome.

Open weights and web inference: Pretrained weights are released on Hugging Face under an MIT-licensed codebase, with an interactive web demo for inference.

Technical Details

Applications

Impact

FishMamba-1

Key Features

Technical Details

Applications

Impact

Citation

FishMamba-1: A Linear-Complexity Foundation Model for Deciphering Polyploid Cyprinid Genomes

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

FishMamba-1

Key Features

Technical Details

Applications

Impact

Citation

FishMamba-1: A Linear-Complexity Foundation Model for Deciphering Polyploid Cyprinid Genomes

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

FishMamba-1

#Key Features

#Technical Details

#Applications

#Impact

Citation

FishMamba-1: A Linear-Complexity Foundation Model for Deciphering Polyploid Cyprinid Genomes

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

FishMamba-1

#Key Features

#Technical Details

#Applications

#Impact

Citation

FishMamba-1: A Linear-Complexity Foundation Model for Deciphering Polyploid Cyprinid Genomes

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact