Kitasato University / University of Tokyo / National Institute of Advanced Industrial Science and Technology
GPT-2-style generative foundation model pretrained on ~165M mRNA coding sequences across all three domains of life for de novo CDS generation and design.
mRNA-GPT is a generative language foundation model for messenger RNA coding sequences (CDS), developed by researchers at Kitasato University together with collaborators at the University of Tokyo and Japan's National Institute of Advanced Industrial Science and Technology (AIST), and posted to bioRxiv in December 2025. While protein and DNA language models have proliferated, generative language models aimed specifically at de novo mRNA sequence design have remained comparatively underexplored, and mRNA-GPT is positioned as the first large generative language model built explicitly for mRNA coding-sequence generation and design.
This entry describes the Kitasato/Saito-group model and is distinct from the similarly named Sanofi-affiliated "mRNA-GPT" (a separate full-length 5' UTR–CDS–3' UTR design model). The Kitasato mRNA-GPT instead focuses on the coding sequence and is trained in a domain-aware fashion across bacteria, eukaryotes, and archaea, so that a single architecture captures lineage-specific regularities such as codon-usage regimes and compositional preferences.
The model is intended as a reusable backbone: after large-scale autoregressive pretraining, it can be steered toward concrete therapeutic and research objectives through lightweight fine-tuning on property-labeled data, rather than relying on fixed codon-optimization tables.
mRNA-GPT uses a GPT-2-style decoder-only transformer architecture (reported at roughly 24 layers) with approximately 302 million parameters, trained autoregressively on mRNA coding sequences collected and preprocessed from NCBI. Pretraining is split across three domains of life: ~80 million bacterial CDS from 19,676 species, ~83 million eukaryotic CDS from 4,688 species, and ~2 million archaeal CDS from 702 species — roughly 165 million coding sequences in aggregate. Training each domain model separately lets the series encode domain-specific codon and compositional structure.
For downstream design, the pretrained model is fine-tuned on property-specific datasets. The preprint reports that fine-tuned mRNA-GPT generates sequences with significantly higher translation-efficiency scores, and that additional fine-tuning on mRNA-stability and mRNA-expression datasets likewise yields high-performing designed sequences, with an optional PPO-based reinforcement-learning stage for direct property optimization.
mRNA-GPT targets the design of mRNA coding sequences for therapeutic and research use — including vaccines, protein-replacement therapies, and other mRNA-based modalities — where translation efficiency, stability, and expression level are key product attributes. Because the backbone is fine-tunable, groups can adapt it to specialized objectives such as tissue-specific expression or organism-specific stability using their own labeled data, integrating generative design into existing mRNA-engineering pipelines without building a model from scratch.
mRNA-GPT contributes a generative, foundation-model approach to a part of the mRNA-design problem space that had been dominated by tabular codon-optimization heuristics and discriminative property predictors. The authors release training and inference code via a public GitHub repository, though it ships without a license and is thus effectively all-rights-reserved. The preprint states that pretrained models are publicly available, but the openly findable checkpoints do not clearly substantiate this: the only weight-bearing HuggingFace repository located is a GPT-2 config of roughly 124 million parameters (12 layers), which does not match the paper's stated ~302M / 24-layer model, and it ships an empty model card with no license (a companion repository contains only a tokenizer). Open availability of weights matching the 302M model therefore could not be confirmed. As a December 2025 preprint, its empirical claims await peer review and broader independent benchmarking, but it establishes a clear template for generative mRNA coding-sequence design, complementing related efforts on mRNA representation models (mRNA-FM) and full-length therapeutic mRNA design tools.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data