SpliceBERT is a BERT-based RNA language model developed by the Biomed AI group and designed specifically for RNA splicing analysis. Pre-trained on over 2 million precursor messenger RNA (pre-mRNA) sequences drawn from 72 vertebrate species, it learns generalizable representations of RNA sequence that encode evolutionary conservation patterns and the sequence grammar underlying splice site selection. The model was published in Briefings in Bioinformatics in 2024.
RNA splicing is a critical step in gene expression: the spliceosome must precisely identify donor and acceptor splice sites within pre-mRNA to excise introns and join exons into mature mRNA. Errors in splicing — whether caused by mutations at splice sites or in nearby regulatory elements — underlie a substantial fraction of human genetic disease. SpliceBERT addresses the need for a general-purpose sequence encoder that understands splicing signals without requiring task-specific architectures for each downstream prediction problem.
The model adopts masked language modeling (MLM) as its self-supervised pre-training objective, masking 15% of nucleotide positions and learning to predict them from bidirectional context. This approach, borrowed from NLP, allows the model to develop rich contextual embeddings from unlabeled sequence data alone, with no requirement for experimentally annotated splice sites during pre-training.
SpliceBERT is based on the BERT transformer encoder architecture, adapted for nucleotide sequences. The model processes RNA sequences bidirectionally using multi-head self-attention, allowing each position to attend to all other positions in the input window. This bidirectional context is important for splicing prediction because splice site recognition depends on both upstream and downstream sequence elements, including the polypyrimidine tract, branch point sequence, and exonic splicing enhancers.
Training data consisted of more than 2 million pre-mRNA sequences from 72 annotated vertebrate genomes, providing broad phylogenetic diversity. The MLM masking strategy replaces 15% of input nucleotides with either a mask token, a random nucleotide, or the original token, following the standard BERT protocol. The resulting nucleotide-level embeddings encode positional context, evolutionary information, and functional signals in a format suitable for downstream fine-tuning. The model is implemented in PyTorch with Hugging Face Transformers compatibility, and pretrained weights and fine-tuning scripts are distributed through the project's GitHub repository.
SpliceBERT is suited to any research task that requires sequence-based understanding of RNA splicing. Computational genomics teams can fine-tune the model for splice site and branchpoint prediction across vertebrate species, including newly sequenced organisms that lack extensive experimental annotation. In clinical genomics, the zero-shot variant scoring capability is particularly valuable for interpreting variants of uncertain significance that fall near splice sites or in deep intronic regions — a common challenge in rare disease diagnosis. Functional genomics studies can use SpliceBERT embeddings to explore the sequence determinants of tissue-specific and condition-specific alternative splicing. The model also provides a foundation for designing therapeutic splice-switching oligonucleotides by quantifying how target-site sequences influence spliceosome recruitment.
SpliceBERT demonstrated that large-scale cross-species pre-training on RNA sequences can yield sequence encoders competitive with or superior to task-specific models on multiple splicing benchmarks, establishing a strong baseline for the RNA foundation model space. Its zero-shot variant effect prediction capability is particularly notable: without any fine-tuning on labeled variant datasets, the model produces scores that correlate with experimentally measured splicing changes, illustrating how language model pre-training implicitly learns functional sequence constraints. The public release of model weights and code has enabled adoption in both academic and clinical research settings. A current limitation is that SpliceBERT operates on primary sequence only and does not incorporate RNA secondary structure or protein binding information, which are known contributors to splicing regulation and may limit accuracy for highly context-dependent splicing decisions.
Chen, K., Zhou, Y., Ding, M., Wang, Y., Ren, Z., & Yang, Y. (2024). Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Briefings in Bioinformatics, 25(3), bbae163.
DOI: 10.1093/bib/bbae163