bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
RNA

RNA-FM

AI for Science (PKU)

A BERT-based RNA foundation model trained on 23.7 million non-coding RNA sequences, producing embeddings for structure prediction, functional annotation, and RNA design.

Released: 2022
Parameters: 100,000,000

Overview

RNA-FM is a foundation model for non-coding RNA developed by the ml4bio group at Peking University. Released in 2022, it addresses a central challenge in RNA biology: extracting structural and functional information from RNA sequences without relying on costly experimental annotations. By training a BERT-style encoder on 23.7 million non-coding RNA sequences from RNAcentral100 using masked-token prediction, RNA-FM learns general-purpose representations that capture secondary structure, three-dimensional proximity, and evolutionary conservation signals simultaneously.

The model occupies an important position in the RNA modeling landscape as one of the first large-scale foundation models designed exclusively for non-coding RNA. Prior to RNA-FM, computational RNA analysis depended heavily on energy minimization methods or smaller supervised models trained on narrow datasets. RNA-FM demonstrated that self-supervised pretraining on large, unannotated RNA corpora can generate embeddings that substantially outperform prior single-sequence approaches across a range of structure and function benchmarks.

RNA-FM has since anchored a broader ecosystem of downstream tools. Most notably, its embeddings serve as the backbone for RhoFold and RhoFold+, a state-of-the-art RNA 3D structure prediction pipeline published in Nature Methods in 2024. The companion model mRNA-FM extends the same pretraining approach to 45 million messenger RNA coding sequences using codon-based tokenization, broadening coverage beyond non-coding RNA.

Key Features

  • Self-supervised pretraining: Trained on 23.7 million non-coding RNA sequences from RNAcentral100 without experimental labels, using masked-token prediction to learn intrinsic RNA sequence patterns.
  • Multi-scale embeddings: The 640-dimensional per-nucleotide embeddings jointly encode secondary structure base-pairing, 3D spatial proximity, and evolutionary conservation, enabling use across structurally distinct RNA families.
  • Broad task coverage: A single pretrained model generalizes to structure prediction, RNA family clustering, subcellular localization, RNA-protein interaction modeling, and inverse RNA design without task-specific architectural changes.
  • Single-sequence and MSA modes: Supports inference from individual sequences alone or with multiple sequence alignment (MSA) context when homologs are available, allowing flexibility across data-rich and data-sparse settings.
  • Interpretable representations: Embeddings correlate with experimentally known structural and functional features, supporting mechanistic hypothesis generation rather than purely black-box prediction.
  • Extensible ecosystem: Serves as the pretrained backbone for RhoFold+, RiboDiffusion, and RhoDesign, enabling downstream tools to leverage a shared high-quality RNA representation.

Technical Details

RNA-FM implements a 12-layer BERT encoder with approximately 100 million parameters and a hidden dimension of 640. The model uses standard bidirectional self-attention over nucleotide tokens drawn from a 25-token vocabulary covering the four canonical RNA bases, modified nucleotides, and special tokens. Pretraining followed the masked language modeling objective on 23.7 million sequences from RNAcentral release 100, representing diverse non-coding RNA classes including rRNA, tRNA, lncRNA, snoRNA, and viral RNAs across multiple taxonomic domains.

In benchmark evaluations reported in the original preprint, RNA-FM outperformed all tested single-sequence RNA language models on both structure-related tasks (secondary structure prediction, 3D contact map prediction) and function-related tasks (RNA family classification, subcellular localization, RNA-protein binding). The companion mRNA-FM model uses 12 layers with 1,280 hidden dimensions and codon-level tokenization, trained on 45 million mRNA coding sequences to capture translation-relevant signals absent in the non-coding model. Both models are available as open-source PyTorch implementations with pretrained weights on the ml4bio GitHub repository.

Applications

RNA-FM embeddings integrate into research workflows as fixed feature extractors or fine-tuning starting points. Structural biologists use RhoFold+, which builds on RNA-FM, to predict tertiary structures of RNA molecules including riboswitches, ribozymes, and viral elements — bypassing the historically expensive experimental structure determination pipeline. RNA engineers use RiboDiffusion and RhoDesign, both RNA-FM-based tools, for inverse RNA design tasks such as scaffold design and sequence optimization, with reported improvements in sequence recovery rates of 11–50% over prior methods. Computational biologists apply the model's embeddings to predict RNA subcellular localization and RNA-protein interactions as components in larger annotation pipelines. The model has also been applied to SARS-CoV-2 genomic RNA analysis and has inspired domain-specific adaptations such as PlantRNA-FM, a model pretrained on RNA from 1,124 plant species for plant-specific regulatory RNA research.

Impact

RNA-FM established self-supervised learning on large unannotated RNA corpora as a viable and productive strategy for RNA biology, analogous to what ESM-2 demonstrated for protein sequences. Its release catalyzed a cluster of derivative tools — RhoFold+, RiboDiffusion, RhoDesign, PlantRNA-FM — collectively advancing both prediction and design capabilities for RNA. The work highlighted that foundation model approaches could generalize across structurally diverse RNA families, a non-obvious result given the greater structural heterogeneity of non-coding RNA compared to proteins. Key limitations include the model's focus on non-coding RNA, which limits direct applicability to mRNA-specific tasks without the separate mRNA-FM variant, and the absence of explicit 3D structural supervision during pretraining, meaning downstream structure prediction tools still require dedicated geometric modules. Nonetheless, RNA-FM remains a widely used reference model in RNA computational biology and a benchmark point for subsequent RNA foundation models.

Citation

Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

Preprint

Chen, J., Hu, Z., Sun, S., Tan, Q., Wang, Y., Yu, Q., Zong, L., Hong, L., Xiao, J., Shen, T., King, I., & Li, Y. (2022). Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions. bioRxiv. https://doi.org/10.1101/2022.08.06.503062

DOI: 10.1101/2022.08.06.503062

Metrics

GitHub

Stars362
Forks42
Open Issues16
Contributors4
Last Push11mo ago
LanguageJupyter Notebook
LicenseMIT

Citations

Total Citations225
Influential41
References70

Tags

structure predictionfoundation modellanguage model

Resources

GitHub RepositoryResearch Paper