Overview

Orthrus is a mature RNA foundation model developed by Bo Wang's group at the University of Toronto and the Vector Institute, introduced in October 2024. The model addresses a gap in genomic AI: while most existing foundation models treat RNA sequences like text — borrowing pre-training objectives from natural language processing such as masked token prediction — Orthrus is the first to incorporate biological domain knowledge directly into the self-supervised learning objective. The result is a model whose learned representations reflect the functional and evolutionary organization of the transcriptome rather than superficial sequence similarity.

The core innovation in Orthrus is a contrastive learning strategy built around biological augmentations. Rather than creating training pairs by randomly masking or corrupting sequences, Orthrus pairs transcripts that are known to be biologically related: splice isoforms from the same gene (drawn from 10 model organisms) and orthologous transcripts from 400+ mammalian species catalogued by the Zoonomia Project. By training the model to produce similar embeddings for these biologically matched pairs, Orthrus is guided to cluster sequences according to shared function and evolutionary conservation — properties directly relevant to downstream biological prediction tasks.

Orthrus uses a Mamba backbone rather than a transformer encoder. Mamba is a structured state-space model (SSM) architecture designed for long sequences: it scales memory linearly with sequence length rather than quadratically as in attention-based models, making it well suited to full-length mature RNA transcripts that can exceed 12,000 nucleotides. The model was developed at the intersection of the Bowang Lab, the Frey Lab, and the Morris Lab, with senior authors Benjamin Frey, Quaid Morris, Leo Lee, and Bo Wang.

Key Features

Biologically-grounded contrastive pretraining: Training pairs are formed from splice isoforms and cross-species orthologs rather than random sequence perturbations, embedding evolutionary and functional structure directly into the representation space.
Mamba state-space model backbone: Linear memory scaling enables efficient processing of complete mature RNA sequences up to 12,000+ nucleotides without the quadratic cost of transformer self-attention.
Two model configurations: A 4-track base model (one-hot nucleotide encoding, 256-dimensional embeddings) and a 6-track large model (adds splice site and coding sequence indicators, 512-dimensional embeddings) to balance speed and biological detail.
Isoform-level discrimination: Unlike models that operate on genomic DNA windows, Orthrus produces distinct embeddings for individual transcript isoforms of the same gene, capturing divergent functional roles that isoform switching can produce.
Data-efficient fine-tuning: Orthrus representations reach competitive performance using a fraction of the labeled data required by existing genomic foundation models, making it practical for tasks where experimental measurements are scarce.
Parameter efficiency: Orthrus's best-performing model achieves state-of-the-art results while using over 700 times fewer parameters than the next best self-supervised baseline, demonstrating that biologically informed pretraining is more efficient than scale alone.

Technical Details

Orthrus is built on the Mamba encoder architecture — a selective state-space model that uses input-dependent state transitions to compress long-range context without full pairwise attention. The 4-track base model accepts one-hot encoded RNA sequences and produces 256-dimensional sequence embeddings. The 6-track large model extends the input representation with two additional binary tracks encoding splice site positions and coding sequence boundaries, producing 512-dimensional embeddings. This additional biological annotation substantially improves performance on tasks that depend on transcript structure, such as ribosome load prediction and exon junction detection.

The model is pretrained on approximately 45 million mature RNA transcripts using a contrastive objective. Biologically related transcript pairs are mined from two sources: splice isoforms of the same gene across 10 model organisms (including human, mouse, zebrafish, and Drosophila) and orthologous gene transcripts from over 400 mammalian species from the Zoonomia Project. The contrastive loss maximizes cosine similarity between embeddings of paired sequences while pushing apart unrelated transcripts. Orthrus is evaluated on five mRNA property prediction tasks — including mRNA half-life, ribosome load, exon junction detection, and Gene Ontology classification — using homology-based data splits to avoid train-test leakage. It significantly outperforms prior genomic foundation models on all five tasks. Training configurations and model weights are distributed via Zenodo and HuggingFace. Implementation requires Python 3.10+, PyTorch 2.2+, and PyTorch Lightning 2.4+.

Applications

Orthrus is designed for researchers working on post-transcriptional gene regulation, RNA therapeutics, and transcriptome-wide functional annotation. Its embeddings can be fine-tuned for mRNA half-life prediction (relevant to mRNA vaccine and therapeutic RNA design), ribosome load estimation (a proxy for translational efficiency), and functional classification of novel transcripts. The isoform-discriminative representations make it particularly useful for studying alternative splicing, where different isoforms of the same gene can have opposing functional consequences. Because it requires only modest amounts of labeled data for fine-tuning, Orthrus is accessible for tasks where large annotated datasets do not exist.

Impact

Orthrus establishes that biologically informed contrastive pretraining is more effective for RNA property prediction than the masked language modeling objectives imported from NLP, and that this efficiency advantage is large enough to matter in practice — achieving better results with dramatically fewer parameters. The model is among the first to leverage the Zoonomia Project's cross-species sequence data as a structured supervision signal rather than simply as additional pretraining text. As a preprint (last updated July 2025), Orthrus has not yet undergone peer review, and its benchmark comparisons should be interpreted with that caveat. The code and weights are publicly released under a CC BY 4.0 license, supporting open reuse. A key current limitation is that the model operates on mature RNA sequences only and does not model secondary structure, protein interactions, or co-transcriptional folding, areas where complementary tools remain necessary.

Overview

Key Features

Biologically-grounded contrastive pretraining: Training pairs are formed from splice isoforms and cross-species orthologs rather than random sequence perturbations, embedding evolutionary and functional structure directly into the representation space.

Mamba state-space model backbone: Linear memory scaling enables efficient processing of complete mature RNA sequences up to 12,000+ nucleotides without the quadratic cost of transformer self-attention.

Two model configurations: A 4-track base model (one-hot nucleotide encoding, 256-dimensional embeddings) and a 6-track large model (adds splice site and coding sequence indicators, 512-dimensional embeddings) to balance speed and biological detail.

Isoform-level discrimination: Unlike models that operate on genomic DNA windows, Orthrus produces distinct embeddings for individual transcript isoforms of the same gene, capturing divergent functional roles that isoform switching can produce.

Data-efficient fine-tuning: Orthrus representations reach competitive performance using a fraction of the labeled data required by existing genomic foundation models, making it practical for tasks where experimental measurements are scarce.

Parameter efficiency: Orthrus's best-performing model achieves state-of-the-art results while using over 700 times fewer parameters than the next best self-supervised baseline, demonstrating that biologically informed pretraining is more efficient than scale alone.

Technical Details

Applications

Impact

Orthrus

Overview

Key Features

Technical Details

Applications

Impact

Citation

Orthrus: Towards Evolutionary and Functional RNA Foundation Models

Metrics

GitHub

Citations

HuggingFace

Tags

Resources

Orthrus

Overview

Key Features

Technical Details

Applications

Impact

Citation

Orthrus: Towards Evolutionary and Functional RNA Foundation Models

Metrics

GitHub

Citations

HuggingFace

Tags

Resources