Long-context (64 kb) RNA foundation model using the Striped-Hyena architecture for zero-shot prediction of transcriptome architecture from unspliced pre-mRNA sequence.
Mach-1 is a long-context RNA foundation model that maps unspliced pre-mRNA sequence to transcriptome architecture — including isoform abundances, RNA secondary structure, and the effects of splicing variants. The model uses a Striped-Hyena architecture, hybridizing convolutional state-space layers with sparse attention to scale to 64-kilobase context windows, sufficient to process most full-length human pre-mRNAs in a single forward pass.
Originally posted to bioRxiv in August 2024, Mach-1 underwent substantial extension and re-validation, with a major v3 update posted in April 2026 that included experimental validation through CRISPR editing and de novo transcript synthesis. The April 2026 version is treated here as the definitive release.
Mach-1 uses a Striped-Hyena state-space backbone trained on a curated corpus of human pre-mRNA sequences with paired transcriptome-level annotations from GTEx and ENCODE. The model is trained autoregressively over nucleotide tokens with auxiliary multitask heads predicting splice-site probabilities and isoform-abundance vectors. Training details, ablations, and benchmarking against SpliceAI and SpliceTransformer are reported in the bioRxiv preprint.
Validation in the April 2026 version includes CRISPR-based perturbation of predicted splicing-regulatory elements and synthesis of designed transcripts to confirm predicted abundance patterns.
Mach-1 is suited for variant interpretation in clinical genomics where splicing effects are suspected, transcript engineering for therapeutic mRNA design with intronic regulatory cassettes, and basic RNA-biology research into splicing regulation. Its long-context capability is particularly important for genes with long introns and distally regulated exons.
Mach-1 is among the first RNA foundation models to combine genuinely long context (64 kb) with experimental validation. By demonstrating that pre-mRNA-to-transcriptome modeling can be approached as a zero-shot foundation-model task, it extends the foundation-model paradigm to a problem class previously addressed primarily by purpose-built supervised models such as SpliceAI. The Striped-Hyena architecture choice positions it alongside Evo and Caduceus in the long-context bio-FM family.
Saberi, A., et al. (2026) Learning transcriptome architecture from sequence with a long-context RNA foundation model. bioRxiv.
DOI: 10.1101/2024.08.26.609813