An RNA language model trained by self-supervised masked-nucleotide prediction on ~50,000 IRES sequences that predicts secondary-structure features rivaling experimental chemical probing.
Albatross is an RNA language model that predicts secondary-structure features of internal ribosome entry sites (IRESes) directly from sequence, with accuracy reported to be comparable to experimental chemical probing. IRESes are structured RNA elements that recruit the ribosome and initiate translation independently of the canonical 5' cap, making them valuable for mRNA therapeutics, synthetic biology, and the study of viral and cellular gene expression. Because the structure of an IRES is tightly coupled to its activity, fast and accurate structural prediction is a long-standing bottleneck — experimental probing methods such as SHAPE and DMS-MaPseq are informative but laborious and difficult to scale across large element libraries.
The model was developed in the Rouskin Lab in the Department of Microbiology at Harvard Medical School and described in a bioRxiv preprint posted in May 2026 (Sychla, Bongrand, Yang, Rulison, Wesselhoeft, Bisaria, and Rouskin). Albatross is trained by self-supervised masked-nucleotide prediction on roughly 50,000 IRES sequences, learning the statistical regularities of IRES sequence and folding without explicit structural labels. Once trained, it generalizes to new sequences without re-training, allowing structural features to be inferred across very large sequence collections.
In the landscape of RNA foundation models — alongside RNA-FM, ERNIE-RNA, and 5' UTR-LM — Albatross is distinguished by its narrow specialization on IRES biology and its emphasis on translating learned representations into a practical, high-throughput structural mapping pipeline rather than a general-purpose RNA encoder. Note that this work is a preprint and has not yet completed peer review.
Albatross is an RNA language model trained with a self-supervised masked-nucleotide prediction objective on a corpus of about 50,000 IRES sequences; this objective lets the model learn sequence and structural regularities without requiring labeled structures. The preprint does not state the underlying base model — that is, whether Albatross is trained from scratch or fine-tuned from a general-purpose RNA language model — nor does it report the parameter count, so these architectural details are currently unspecified. After training, the model was used to produce structural maps for approximately 75,000 full-length IRES elements, of which 96 were experimentally validated, and to surface a high-activity Type V IRES class reported to double EMCV-standard activity.
As of this writing, no public code or model weights have been confirmed: the Rouskin Lab GitHub organization does not yet host an Albatross repository. The preprint is released under a CC BY license, but the license that would govern any released model weights is unknown. Researchers seeking to reproduce or build on the work should monitor the lab's repositories for a future release.
Albatross is most directly useful for researchers engineering or characterizing IRES elements. In mRNA therapeutics and synthetic biology, where cap-independent translation can be exploited for multicistronic constructs or circular RNA designs, fast structural prediction helps prioritize candidates before costly experimental validation. The reported Type V IRES class, with roughly double the activity of the common EMCV benchmark, points to immediate utility in maximizing protein output per transcript. More broadly, the high-throughput structural mapping approach offers virologists and RNA biologists a way to triage and annotate large IRES libraries that would be infeasible to probe experimentally in full.
By demonstrating that a self-supervised model trained on IRES sequences can predict structural features at accuracy comparable to chemical probing — and then using that model to map tens of thousands of elements and surface a high-activity IRES class — Albatross illustrates how specialized RNA language models can compress experimental structural workflows into scalable in-silico pipelines. The discovery of Type V IRESes that double EMCV-standard activity is a concrete payoff with relevance to mRNA and gene-therapy design. The work's near-term impact is tempered by its preprint status and the current absence of confirmed public code, weights, or a stated base architecture and parameter count, all of which limit independent reproduction until the authors release additional artifacts.
Sychla, A., et al. (2026) An RNA Language Model trained on sequence alone reveals the structural logic of Internal Ribosome Entry Sites. bioRxiv.
DOI: 10.64898/2026.05.19.726202