Peking University / Griffith University
Unsupervised RNA language model using multiple sequence alignments to predict secondary structure and solvent accessibility from evolutionary information.
RNA-MSM is an unsupervised RNA language model developed by researchers at Peking University and Griffith University that addresses a fundamental limitation of single-sequence RNA language models. While BERT-style models have proven effective for proteins, RNA sequences are far less conserved across species, which means single-sequence approaches miss much of the evolutionary signal that encodes structural constraints. RNA-MSM resolves this by operating on multiple sequence alignments (MSAs) of homologous RNA sequences rather than individual sequences, following the architectural logic that ESM-MSM brought to protein modeling.
The model was trained on MSAs derived from 3,932 Rfam families, with alignments generated automatically through a pipeline called RNAcmap3. RNAcmap3 combines BLAST-N, Infernal, and secondary structure prediction tools such as RNAfold to identify and align homologous sequences without manual curation. Published in Nucleic Acids Research in January 2024, RNA-MSM demonstrated that its unsupervised representations carry structural information that can be decoded with high accuracy — and that fine-tuning on downstream tasks surpasses existing state-of-the-art methods.
A distinctive property of RNA-MSM is that its two output types — two-dimensional attention maps and one-dimensional sequence embeddings — align naturally with two different classes of structural prediction targets. Attention maps encode pairwise residue relationships and correlate directly with base pairing probabilities, while embeddings capture local structural context relevant to solvent accessibility. This makes the model a flexible backbone for multiple downstream tasks without requiring separate pre-training for each.
RNA-MSM is built around an MSA transformer with 10 stacked blocks, each containing 12 attention heads and an embedding dimension of 768. The architecture includes an initial embedding layer plus two learnable positional embedding layers that encode rows (number of MSA sequences) and columns (sequence position) independently, allowing the model to handle variable-depth alignments.
Training used MSAs from 3,932 Rfam families generated by RNAcmap3. The median MSA depth across these families was 2,184 sequences. Unsupervised training ran for 300 epochs on eight 32 GB V100 GPUs. On the secondary structure benchmark (test set TS), the RNA-MSM attention-based predictor achieves an AUC-PR of 0.610 and an F1 score of 0.707, compared to an AUC-PR of 0.56 for SPOT-RNA2. For solvent accessibility, the embedding-based predictor achieves a Pearson correlation coefficient (PCC) of 0.436 and a mean absolute error (MAE) of 31.64, representing approximately 7% improvement in PCC and 3% improvement in MAE over RNAsnap2. A practical constraint is that generating an MSA for a single RNA of length 60 takes roughly 9 hours on average using the RNAcmap3 pipeline.
RNA-MSM is suited for structural RNA research, particularly where evolutionary context is available or can be computed. Researchers studying non-coding RNAs, ribozymes, or riboswitch elements can use the model to predict secondary structures and identify exposed versus buried regions from sequence alone. The solvent accessibility predictions are relevant for understanding RNA-protein interaction interfaces, since accessible regions are more likely to serve as binding sites. Computational drug discovery pipelines targeting RNA — an emerging area given the pharmacological interest in RNA structures — can use RNA-MSM embeddings as features for binding site identification or ligand design. The model is also a candidate backbone for transfer learning workflows where labeled structural data is scarce.
RNA-MSM established the value of MSA-based pre-training for RNA, paralleling advances that the protein field had achieved with models like ESM-MSM and early AlphaFold work. Its publication in Nucleic Acids Research validated the approach against established RNA structure prediction benchmarks, and the code and model weights are publicly available on GitHub and HuggingFace. The primary limitation is the computational cost of generating input MSAs via RNAcmap3, which restricts throughput for large-scale screening. Additionally, RNA-MSM focuses on secondary structure and solvent accessibility; tertiary structure prediction remains beyond its current scope. Subsequent RNA models have explored single-sequence approaches with larger training corpora, but RNA-MSM's framework remains a relevant reference point for leveraging homology information in RNA structural biology.
Zhang, Y., Lang, M., Jiang, J., et al. (2024). Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Research, 52(1), e3.
DOI: 10.1093/nar/gkad1031