CDS-BART

Coding-sequence foundation model for mRNA design, pretrained as a BART denoising encoder-decoder on mRNA from nine taxonomic groups.

Released: March 2026

Parameters: 200 Million

Messenger RNA has become a central therapeutic modality, but the design of mRNA sequences—optimizing codon usage, stability, and translational output—remains largely heuristic. CDS-BART is a sequence-to-sequence foundation model that brings self-supervised representation learning to the messenger RNA coding sequence (CDS), aiming to give therapeutic and synthetic-biology designers a general-purpose, fine-tunable backbone rather than a single task-specific predictor.

Developed by the MOGAM Institute for Biomedical Research and described in a March 2026 bioRxiv preprint, CDS-BART adapts the BART denoising autoencoder—an encoder–decoder transformer pretrained to reconstruct corrupted text—to nucleotide sequences. The model is pretrained on mRNA from NCBI RefSeq spanning nine taxonomic groups, giving it cross-species exposure rather than a single-organism view. Unlike encoder-only nucleotide language models, its seq2seq design naturally supports both representation extraction and generative reconstruction.

CDS-BART sits alongside a small but growing class of RNA-focused foundation models (such as RNA-FM and CodonBERT), but is distinguished by its denoising seq2seq objective and its explicit focus on coding sequences up to therapeutic mRNA lengths.

Key Features

Denoising seq2seq pretraining: A BART encoder–decoder is trained to reconstruct corrupted mRNA coding sequences, yielding representations that transfer to downstream regression and classification tasks.
Cross-species coverage: Pretraining on NCBI RefSeq mRNA from nine taxonomic groups exposes the model to broad codon and sequence diversity rather than a single organism.
SentencePiece tokenization: A learned subword vocabulary lets the model handle long coding sequences efficiently, supporting mRNA on the scale of therapeutic constructs (~4 kb).
Open weights and benchmark suite: Code is released under the MIT license, weights are on Hugging Face, and six curated fine-tuning datasets cover expression, stability, and riboswitch tasks.

Technical Details

CDS-BART is a roughly 200M-parameter BART model with a SentencePiece tokenizer and a maximum context of 850 subword tokens, sufficient to span mRNA coding sequences of approximately 4 kb. Pretraining uses the standard BART denoising objective—text infilling and corruption followed by reconstruction—applied to RefSeq mRNA across nine taxonomic groups. For downstream use, the encoder representations are fine-tuned on task-specific datasets released alongside the model: E. coli protein expression, fungal gene expression, mRFP production, cross-species mRNA stability, SARS-CoV-2 vaccine mRNA degradation, and tetracycline (Tc) riboswitch activity. These six benchmarks let practitioners evaluate the model on both regression (expression, stability) and classification settings.

Applications

CDS-BART targets mRNA therapeutic and synthetic-biology workflows where designers must predict how a coding sequence will behave before synthesis. Concrete use cases include screening codon-optimized variants for protein expression, ranking constructs for mRNA stability and vaccine shelf-life, and engineering riboswitch elements. Because the backbone is pretrained and openly available, groups can fine-tune it on their own assay data with relatively modest labeled datasets, making it useful for academic RNA biology labs and industrial mRNA design teams alike.

Impact

As one of the first foundation models built specifically for messenger RNA coding sequences at therapeutic length, CDS-BART helps extend the foundation-model paradigm from proteins and DNA into the rapidly expanding mRNA-therapeutics space. Its MIT license, public weights, and packaged benchmark datasets lower the barrier for reproducible comparison and downstream fine-tuning. As a recent preprint, its benchmark advantages over existing RNA language models remain to be independently validated, and the released model card does not yet report comprehensive evaluation metrics—caveats worth noting for prospective adopters.

Citation

CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis

Jadamba, E., et al. (2026) CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis. bioRxiv.

DOI: 10.64898/2026.03.09.710670

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References3

GitHub

Stars0

Forks0

Open Issues0

Contributors2

Last Push9mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads9

Likes0

Last Modified10mo ago

Pipelinefeature-extraction

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

63Partial

Usability — can I run it?73

Reproducibility — can I retrain it?53

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Denoising seq2seq pretraining: A BART encoder–decoder is trained to reconstruct corrupted mRNA coding sequences, yielding representations that transfer to downstream regression and classification tasks.

Cross-species coverage: Pretraining on NCBI RefSeq mRNA from nine taxonomic groups exposes the model to broad codon and sequence diversity rather than a single organism.

SentencePiece tokenization: A learned subword vocabulary lets the model handle long coding sequences efficiently, supporting mRNA on the scale of therapeutic constructs (~4 kb).

Open weights and benchmark suite: Code is released under the MIT license, weights are on Hugging Face, and six curated fine-tuning datasets cover expression, stability, and riboswitch tasks.

Technical Details

Applications

Impact

CDS-BART

Key Features

Technical Details

Applications

Impact

Citation

CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

CDS-BART

Key Features

Technical Details

Applications

Impact

Citation

CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

CDS-BART

#Key Features

#Technical Details

#Applications

#Impact

Citation

CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

CDS-BART

#Key Features

#Technical Details

#Applications

#Impact

Citation

CDS-BART: A BART-Based Foundation Model for mRNA Sequence Analysis

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact