bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
RNA

RNABERT

Keio University

A BERT-based model for RNA base embeddings that captures sequence context and secondary structure, enabling fast structural alignment and clustering.

Released: 2022

Overview

RNABERT is a transformer-based language model developed at Keio University that applies BERT-style pre-training to RNA sequences. Published in 2022, it addresses a longstanding bottleneck in RNA bioinformatics: structural alignment. Traditional methods for aligning RNA sequences depend on the Sankoff algorithm, which explicitly models both sequence similarity and secondary structure but is computationally prohibitive at scale. RNABERT takes a fundamentally different approach — it learns rich, context-aware representations of individual RNA bases, then uses those representations with a simple and fast alignment algorithm to achieve state-of-the-art structural alignment accuracy.

What sets RNABERT apart from a straightforward application of BERT to nucleotide sequences is its dual pre-training strategy. Rather than relying solely on masked language modeling, the model additionally incorporates structural information through a Structural Alignment Learning (SAL) objective. This trains the model so that bases occupying structurally equivalent positions across related RNA families are embedded close together in vector space, even when their primary sequences differ. The result is a 120-dimensional embedding for each base that jointly encodes local sequence context and structural propensity.

The model was trained on 72,237 human non-coding RNA sequences, augmented with masking patterns to 722,370 effective training examples, with structural supervision drawn from seed alignments in the Rfam database. Rfam provides curated RNA families with known secondary structures, making it an appropriate source of structural ground truth for the SAL pre-training objective.

Key Features

  • Dual pre-training with MLM and SAL: Combines masked language modeling — predicting 15% of randomly masked bases from context — with structural alignment learning, which uses Rfam seed alignments to encode secondary structure directly into the embedding space.
  • 120-dimensional base embeddings: Each RNA base (A, C, G, U) is represented as a 120-dimensional vector that captures both positional context from the bidirectional transformer and structural state, forming distinct clusters in embedding space corresponding to known structural motifs.
  • Fast structural alignment: Because structural information is encoded in the embeddings rather than computed on-the-fly, RNABERT can use a computationally simple pairwise alignment algorithm over embeddings rather than the expensive Sankoff algorithm, dramatically reducing inference time.
  • Superior alignment accuracy: Benchmarked against multiple RNA structural alignment tools, RNABERT outperforms existing state-of-the-art methods across multiple accuracy measures, achieving better recall of structurally equivalent positions.
  • Transfer learning for downstream tasks: The pre-trained encoder can be fine-tuned for tasks beyond alignment, including RNA family clustering, classification, and secondary structure prediction, reducing labeled data requirements for specialized applications.

Technical Details

RNABERT uses a 6-layer transformer encoder architecture adapted from BERT. Each layer applies multi-head self-attention over the full sequence, enabling bidirectional context modeling that captures dependencies between distant bases. The embedding dimension is 120 throughout, smaller than large protein language models, reflecting the constrained vocabulary (four canonical RNA bases) and the relatively compact training corpus of non-coding RNA sequences.

Pre-training proceeds in two stages. The MLM task randomly masks 15% of input bases and optimizes the model to reconstruct them from bidirectional context, mirroring the original BERT procedure. The SAL task takes pairs of RNA sequences from the same Rfam family — sequences known to share structural similarity — and applies a contrastive-style objective encouraging bases at aligned positions to have similar embeddings. This directly encodes secondary structure co-variation signals that are absent from sequence alone. At inference, embeddings from the final transformer layer are extracted and used as features for alignment via a standard sequence alignment algorithm, bypassing the need for explicit structure prediction.

Model parameters are modest in absolute terms — the 6-layer, 120-dimensional architecture is intentionally lightweight — which contributes to the fast inference speed. Training was performed on human non-coding RNAs from publicly available databases, with Rfam providing the structural alignment supervision.

Applications

RNABERT is particularly well suited for researchers working with non-coding RNAs, where structural conservation rather than sequence identity is often the biologically meaningful signal. Structural alignment of RNA families is the primary application, enabling high-throughput comparison of sequences across organisms where conventional sequence-based alignment would fail to detect homologs. The clustering capabilities allow unsupervised grouping of RNA sequences into functional families based on structural similarity, which is valuable for annotating novel transcriptomes. The pre-trained embeddings also serve as features for downstream classification tasks such as identifying RNA types (rRNA, tRNA, snRNA, lncRNA) or annotating functional regions within non-coding transcripts.

Impact

RNABERT established the viability of self-supervised language model pre-training as a route to high-quality RNA structural representations, motivating subsequent larger-scale efforts in RNA foundation modeling. By demonstrating that a relatively compact model trained with a structurally-informed objective could outperform dedicated algorithmic alignment tools, it shifted attention toward representation learning as an alternative to explicit structure computation. The model's open-source release on GitHub has made it accessible for reuse and benchmarking. A key limitation is the training corpus size and scope: 72,237 human non-coding RNA sequences is small compared to the datasets used in contemporary protein language models, and the model's performance on RNA types underrepresented in human databases or in Rfam may be limited. As the field has moved toward larger RNA models trained on broader sequence diversity, RNABERT serves as an important early demonstration of the structural alignment learning paradigm.

Citation

Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning

Akiyama, M., & Sakakibara, Y. (2022). Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genomics and Bioinformatics, 4(1), lqac012.

DOI: 10.1093/nargab/lqac012

Metrics

GitHub

Stars56
Forks17
Open Issues4
Contributors0
Last Push2y ago
LanguagePython

Citations

Total Citations119
Influential7
References42

Tags

structure predictionfoundation modellanguage model

Resources

GitHub RepositoryResearch PaperOfficial Website