structRFM

University of Science and Technology of China

RNA foundation model pretrained jointly on sequences and secondary structures for structure prediction, homology and splice site classification.

Released: August 2025

Parameters: 86 Million

structRFM is a structure-guided RNA foundation model developed by researchers at the University of Science and Technology of China (USTC), with S. Kevin Zhou as corresponding author, and released as a bioRxiv preprint in August 2025. It addresses a recurring limitation of sequence-only RNA language models: because RNA function is largely dictated by how a molecule folds, models trained on nucleotide sequences alone struggle to internalize the base-pairing interactions that drive structural and functional behavior. structRFM closes this gap by jointly pretraining on RNA sequences and their secondary structures, baking folding information directly into the learned representations.

The model is pretrained from scratch on approximately 21 million RNA sequence–structure pairs drawn from RNAcentral, with secondary structures supplied by an ensemble annotation pipeline. Its central innovation is a structure-guided masked language modeling (SgMLM) objective that incorporates base-pairing interactions through a pair-matching operation, dynamically balancing sequence-level and structure-level masking during training. To mitigate the annotation bias inherent in any single structure predictor, structRFM uses MUSES, a multi-source ensemble that integrates thermodynamics-based, probability-based, and deep-learning-based secondary-structure predictors.

In the crowded landscape of RNA language models — alongside RNA-FM, RNAErnie, ERNIE-RNA, RiNALMo, and AIDO.RNA — structRFM is distinguished by being fully open (model weights and the complete training dataset are released) and by deriving a tertiary-structure predictor, Zfold, that is competitive with AlphaFold3 on standard RNA structure benchmarks. The authors position it as a general-purpose backbone spanning zero-shot, structural, and functional RNA inference tasks.

Key Features

Structure-guided pretraining (SgMLM): A pair-matching masked language modeling objective injects base-pairing interactions into pretraining, dynamically balancing sequence-level and structure-level masking so the model learns folding propensity alongside sequence identity.
MUSES ensemble annotation: Training-set secondary structures are derived from a multi-source ensemble combining thermodynamic, probabilistic, and deep-learning predictors, reducing the annotation bias that affects models relying on a single structure tool.
Top-tier zero-shot homology classification: Among the RNA language models evaluated, structRFM ranks at the top for zero-shot homology classification, demonstrating that its representations capture evolutionary relationships without task-specific fine-tuning.
State-of-the-art secondary structure prediction: structRFM sets new benchmarks for RNA secondary structure prediction and supports zero-shot 2D structure inference directly from its pairwise feature outputs.
Zfold tertiary-structure predictor: A derived 3D predictor, Zfold, achieves consistent gains over AlphaFold3 — including roughly a 19% improvement on the RNA-Puzzles dataset — and remains competitive on CASP15 and CASP16 RNA targets.
Strong functional inference: On internal ribosome entry site (IRES) identification, structRFM delivers an F1-score improvement of roughly 48% over prior approaches, and also supports splice site prediction and ncRNA classification.

Technical Details

structRFM is a BERT-style encoder transformer with 12 layers, a hidden dimension of 768, and 12 attention heads (approximately 86 million parameters), with a maximum input length of 514 tokens corresponding to RNA sequences up to roughly 512 nucleotides. Longer RNAs (up to ~3,000 nt) are handled through a sliding-window strategy at inference. The training corpus consists of ~21 million sequence–structure pairs assembled from RNAcentral, filtered to sequences of 512 nucleotides or fewer, with secondary-structure labels produced by the MUSES ensemble. The model exposes three feature types — a classification-level feature, a sequence-level feature, and a pairwise matrix feature — that serve as flexible interfaces for downstream tasks.

Across benchmarks, structRFM reports top-ranked zero-shot homology classification among the RNA language models compared, state-of-the-art secondary structure prediction, an approximately 48% F1 gain on IRES identification, and tertiary-structure results (via Zfold) that exceed AlphaFold3 by about 19% on RNA-Puzzles while remaining competitive on CASP15 and CASP16. Zfold is implemented as a downstream task module within the structRFM repository rather than as a separately packaged tool, building on the pretrained backbone's pairwise representations.

Applications

structRFM serves RNA biologists and computational researchers across structural and functional workflows. Structural biologists can use it to predict secondary structures and, via Zfold, to generate tertiary-structure hypotheses for non-coding RNAs prior to experimental determination. RNA therapeutics and synthetic biology researchers benefit from its functional inference capabilities — IRES identification is directly relevant to designing cap-independent translation elements for mRNA constructs, while splice site prediction supports the study of alternative splicing. Its zero-shot homology classification and ncRNA classification capabilities help annotate novel transcripts and organize RNA families. Because both the weights and the full training dataset are openly released, the model is well suited as a reproducible backbone for custom fine-tuning pipelines.

Impact

structRFM advances the RNA foundation model field by demonstrating that explicitly pairing sequences with ensemble-derived secondary structures during pretraining yields representations that transfer strongly across structural and functional tasks, and that such a model can derive a tertiary-structure predictor competitive with AlphaFold3. Its fully open release — pretrained weights, the ~21M-pair training dataset on Zenodo, a HuggingFace model card, and code under an MIT license — sets a high bar for reproducibility in a field where training data is often withheld. As a preprint, its benchmark claims await peer review, and the model inherits practical constraints: a 512-nucleotide native window relying on sliding-window inference for longer RNAs, and structural supervision that is only as reliable as the MUSES ensemble that generated it. Even so, structRFM offers the community a transparent, structure-aware backbone spanning homology, structure, and functional RNA inference.

Citation

A fully open structure-guided RNA foundation model for robust structural and functional inference

Preprint

Zhu, H., et al. (2026) A fully open structure-guided RNA foundation model for robust structural and functional inference. bioRxiv.

DOI: 10.1101/2025.08.06.668731

Recent citations

Papers that recently cited this model.

Detecting and quantifying overparametrization in RNA language models with REDIAL
Da Teng, Yunrui Qiu, Gokulakannan Sakthivel, et al.
bioRxiv · May 2026
1
Machine learning for RNA secondary structure prediction: a review of current methods and challenges
Giuseppe Sacco, Giovanni Bussi, Guido Sanguinetti
RNA: A publication of the RNA Society · Nov 2025
2
NABench: Large-Scale Benchmarks of Nucleotide Foundation Models for Fitness Prediction
Zhongmin Li, Runze Ma, Jia W. Tan, et al.
arXiv.org · Nov 2025
1

Top citations

The most-cited papers that cite this model.

Machine learning for RNA secondary structure prediction: a review of current methods and challenges
Giuseppe Sacco, Giovanni Bussi, Guido Sanguinetti
RNA: A publication of the RNA Society · Nov 2025
2
Detecting and quantifying overparametrization in RNA language models with REDIAL
Da Teng, Yunrui Qiu, Gokulakannan Sakthivel, et al.
bioRxiv · May 2026
1
NABench: Large-Scale Benchmarks of Nucleotide Foundation Models for Fitness Prediction
Zhongmin Li, Runze Ma, Jia W. Tan, et al.
arXiv.org · Nov 2025
1

Citations

Total Citations6

Influential0

References72

GitHub

Stars35

Forks3

Open Issues0

Contributors1

Last Push1mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads32

Likes1

Last Modified6mo ago

Fields of citing research

Biology100%
Computer Science100%
Medicine67%
Physics33%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

92Open

Usability — can I run it?100

Reproducibility — can I retrain it?92

Model Openness Framework

Class II

Open Tooling

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Structure-guided pretraining (SgMLM): A pair-matching masked language modeling objective injects base-pairing interactions into pretraining, dynamically balancing sequence-level and structure-level masking so the model learns folding propensity alongside sequence identity.

MUSES ensemble annotation: Training-set secondary structures are derived from a multi-source ensemble combining thermodynamic, probabilistic, and deep-learning predictors, reducing the annotation bias that affects models relying on a single structure tool.

Top-tier zero-shot homology classification: Among the RNA language models evaluated, structRFM ranks at the top for zero-shot homology classification, demonstrating that its representations capture evolutionary relationships without task-specific fine-tuning.

State-of-the-art secondary structure prediction: structRFM sets new benchmarks for RNA secondary structure prediction and supports zero-shot 2D structure inference directly from its pairwise feature outputs.

Zfold tertiary-structure predictor: A derived 3D predictor, Zfold, achieves consistent gains over AlphaFold3 — including roughly a 19% improvement on the RNA-Puzzles dataset — and remains competitive on CASP15 and CASP16 RNA targets.

Strong functional inference: On internal ribosome entry site (IRES) identification, structRFM delivers an F1-score improvement of roughly 48% over prior approaches, and also supports splice site prediction and ncRNA classification.

Technical Details

Applications

Impact

structRFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

A fully open structure-guided RNA foundation model for robust structural and functional inference

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

structRFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

A fully open structure-guided RNA foundation model for robust structural and functional inference

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact