MOGAM Institute for Biomedical Research
A BART-based foundation model for mRNA coding-sequence analysis, pretrained by denoising across nine taxonomic groups and fine-tunable for expression, stability, and riboswitch tasks.
Messenger RNA has become a central therapeutic modality, but the design of mRNA sequences—optimizing codon usage, stability, and translational output—remains largely heuristic. CDS-BART is a sequence-to-sequence foundation model that brings self-supervised representation learning to the messenger RNA coding sequence (CDS), aiming to give therapeutic and synthetic-biology designers a general-purpose, fine-tunable backbone rather than a single task-specific predictor.
Developed by the MOGAM Institute for Biomedical Research and described in a March 2026 bioRxiv preprint, CDS-BART adapts the BART denoising autoencoder—an encoder–decoder transformer pretrained to reconstruct corrupted text—to nucleotide sequences. The model is pretrained on mRNA from NCBI RefSeq spanning nine taxonomic groups, giving it cross-species exposure rather than a single-organism view. Unlike encoder-only nucleotide language models, its seq2seq design naturally supports both representation extraction and generative reconstruction.
CDS-BART sits alongside a small but growing class of RNA-focused foundation models (such as RNA-FM and CodonBERT), but is distinguished by its denoising seq2seq objective and its explicit focus on coding sequences up to therapeutic mRNA lengths.
CDS-BART is a roughly 200M-parameter BART model with a SentencePiece tokenizer and a maximum context of 850 subword tokens, sufficient to span mRNA coding sequences of approximately 4 kb. Pretraining uses the standard BART denoising objective—text infilling and corruption followed by reconstruction—applied to RefSeq mRNA across nine taxonomic groups. For downstream use, the encoder representations are fine-tuned on task-specific datasets released alongside the model: E. coli protein expression, fungal gene expression, mRFP production, cross-species mRNA stability, SARS-CoV-2 vaccine mRNA degradation, and tetracycline (Tc) riboswitch activity. These six benchmarks let practitioners evaluate the model on both regression (expression, stability) and classification settings.
CDS-BART targets mRNA therapeutic and synthetic-biology workflows where designers must predict how a coding sequence will behave before synthesis. Concrete use cases include screening codon-optimized variants for protein expression, ranking constructs for mRNA stability and vaccine shelf-life, and engineering riboswitch elements. Because the backbone is pretrained and openly available, groups can fine-tune it on their own assay data with relatively modest labeled datasets, making it useful for academic RNA biology labs and industrial mRNA design teams alike.
As one of the first foundation models built specifically for messenger RNA coding sequences at therapeutic length, CDS-BART helps extend the foundation-model paradigm from proteins and DNA into the rapidly expanding mRNA-therapeutics space. Its MIT license, public weights, and packaged benchmark datasets lower the barrier for reproducible comparison and downstream fine-tuning. As a recent preprint, its benchmark advantages over existing RNA language models remain to be independently validated, and the released model card does not yet report comprehensive evaluation metrics—caveats worth noting for prospective adopters.