A BERT-based model for RNA base embeddings that captures sequence context and secondary structure, enabling fast structural alignment and clustering.
RNABERT is a transformer-based language model developed at Keio University that applies BERT-style pre-training to RNA sequences. Published in 2022, it addresses a longstanding bottleneck in RNA bioinformatics: structural alignment. Traditional methods for aligning RNA sequences depend on the Sankoff algorithm, which explicitly models both sequence similarity and secondary structure but is computationally prohibitive at scale. RNABERT takes a fundamentally different approach — it learns rich, context-aware representations of individual RNA bases, then uses those representations with a simple and fast alignment algorithm to achieve state-of-the-art structural alignment accuracy.
What sets RNABERT apart from a straightforward application of BERT to nucleotide sequences is its dual pre-training strategy. Rather than relying solely on masked language modeling, the model additionally incorporates structural information through a Structural Alignment Learning (SAL) objective. This trains the model so that bases occupying structurally equivalent positions across related RNA families are embedded close together in vector space, even when their primary sequences differ. The result is a 120-dimensional embedding for each base that jointly encodes local sequence context and structural propensity.
The model was trained on 72,237 human non-coding RNA sequences, augmented with masking patterns to 722,370 effective training examples, with structural supervision drawn from seed alignments in the Rfam database. Rfam provides curated RNA families with known secondary structures, making it an appropriate source of structural ground truth for the SAL pre-training objective.
RNABERT uses a 6-layer transformer encoder architecture adapted from BERT. Each layer applies multi-head self-attention over the full sequence, enabling bidirectional context modeling that captures dependencies between distant bases. The embedding dimension is 120 throughout, smaller than large protein language models, reflecting the constrained vocabulary (four canonical RNA bases) and the relatively compact training corpus of non-coding RNA sequences.
Pre-training proceeds in two stages. The MLM task randomly masks 15% of input bases and optimizes the model to reconstruct them from bidirectional context, mirroring the original BERT procedure. The SAL task takes pairs of RNA sequences from the same Rfam family — sequences known to share structural similarity — and applies a contrastive-style objective encouraging bases at aligned positions to have similar embeddings. This directly encodes secondary structure co-variation signals that are absent from sequence alone. At inference, embeddings from the final transformer layer are extracted and used as features for alignment via a standard sequence alignment algorithm, bypassing the need for explicit structure prediction.
Model parameters are modest in absolute terms — the 6-layer, 120-dimensional architecture is intentionally lightweight — which contributes to the fast inference speed. Training was performed on human non-coding RNAs from publicly available databases, with Rfam providing the structural alignment supervision.
RNABERT is particularly well suited for researchers working with non-coding RNAs, where structural conservation rather than sequence identity is often the biologically meaningful signal. Structural alignment of RNA families is the primary application, enabling high-throughput comparison of sequences across organisms where conventional sequence-based alignment would fail to detect homologs. The clustering capabilities allow unsupervised grouping of RNA sequences into functional families based on structural similarity, which is valuable for annotating novel transcriptomes. The pre-trained embeddings also serve as features for downstream classification tasks such as identifying RNA types (rRNA, tRNA, snRNA, lncRNA) or annotating functional regions within non-coding transcripts.
RNABERT established the viability of self-supervised language model pre-training as a route to high-quality RNA structural representations, motivating subsequent larger-scale efforts in RNA foundation modeling. By demonstrating that a relatively compact model trained with a structurally-informed objective could outperform dedicated algorithmic alignment tools, it shifted attention toward representation learning as an alternative to explicit structure computation. The model's open-source release on GitHub has made it accessible for reuse and benchmarking. A key limitation is the training corpus size and scope: 72,237 human non-coding RNA sequences is small compared to the datasets used in contemporary protein language models, and the model's performance on RNA types underrepresented in human databases or in Rfam may be limited. As the field has moved toward larger RNA models trained on broader sequence diversity, RNABERT serves as an important early demonstration of the structural alignment learning paradigm.
Akiyama, M., & Sakakibara, Y. (2022). Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genomics and Bioinformatics, 4(1), lqac012.
DOI: 10.1093/nargab/lqac012