Fudan University / Shanghai AI Laboratory / Hunan Normal University
A cross-modal transfer-learning model that adapts the ESM-2 650M protein language model to mRNA analysis by swapping amino-acid tokens for codon tokens, applied to mRNA benchmarks without re-training.
Messenger RNA (mRNA) has become a central modality in modern therapeutics and a rich substrate for computational modeling, yet mRNA-specific foundation models face a chronic shortage of labeled data and the considerable expense of pretraining large language models from scratch. Protein language models, by contrast, have been trained on hundreds of millions of sequences and encode a deep statistical understanding of how biological sequences fold, function, and evolve. ProtmRNA is built on the observation that much of this knowledge is transferable: because coding sequences and the proteins they specify are two views of the same underlying biology, a protein model's representations can be repurposed for the messenger RNA that encodes them.
ProtmRNA, described in a bioRxiv preprint posted on 20 May 2026 by Gang Xu, Xinyu Wu, and Jianpeng Ma (Fudan University, Shanghai AI Laboratory, and Hunan Normal University), implements this idea as a cross-modal transfer-learning recipe. The authors take ESM-2 650M — a 33-layer, 1280-dimensional protein transformer — and substitute its amino-acid vocabulary with a 78-token codon vocabulary, then continue pretraining at codon resolution. The reused architecture and inherited protein-derived weights let the model converge using less than half the training compute of a comparable model trained from scratch.
The model's most distinctive claim is generality without task-specific retraining: after codon-level pretraining, ProtmRNA is applied directly to eight downstream mRNA benchmarks — spanning mRNA stability, gene expression, transcript abundance, and SARS-CoV-2 vaccine degradation — without further fine-tuning of the backbone. This positions ProtmRNA alongside codon-resolution models such as CodonFM and generative designers such as mRNA-GPT, but with a transfer-learning emphasis distinct from both.
ProtmRNA is a transformer encoder derived from ESM-2 650M, comprising 33 layers and a hidden dimension of 1280 (~650M parameters). The central architectural change is the input vocabulary: the amino-acid tokens are replaced with 78 codon tokens, and the model is pretrained at codon level so its representations are conditioned on the actual coding sequence rather than the translated protein. The remainder of the backbone is preserved so that protein-pretrained weights provide the initialization, which the authors report enables convergence at under half the compute of a from-scratch baseline.
For evaluation, the pretrained model is applied to eight downstream mRNA tasks without retraining the backbone, covering mRNA stability, gene expression, transcript abundance, and SARS-CoV-2 mRNA-vaccine degradation prediction. The preprint reports the architecture, the codon-vocabulary construction, the transfer-learning procedure, and the comparative compute savings. As a bioRxiv preprint (DOI 10.64898/2026.05.19.726141, CC BY-NC-ND 4.0), these results have not yet undergone peer review.
ProtmRNA targets mRNA analysis problems where labeled data is scarce and pretraining from scratch is impractical — predicting mRNA stability, gene expression, and transcript abundance, and forecasting degradation of mRNA-vaccine constructs as illustrated by its SARS-CoV-2 benchmark. Because the model reuses an existing protein backbone and is applied without task-specific backbone retraining, it is attractive to RNA-therapeutics and computational-biology groups that want strong codon-level representations without the cost of training a dedicated mRNA foundation model.
ProtmRNA contributes a concrete demonstration that protein language models can serve as a starting point for messenger RNA modeling, reframing the data and compute bottleneck in mRNA foundation models as a transfer-learning problem rather than a from-scratch pretraining one. Its broad, retraining-free evaluation across eight benchmarks is its main evidence of generality. Two caveats temper adoption: the work is an unreviewed preprint, and the released code carries no stated license while pretrained weights are distributed via Google Drive rather than a persistent archive such as HuggingFace or Zenodo, which affects long-term reproducibility and reuse. As released, no formal model card or data card accompanies the repository beyond the README and the preprint.
Xu, G., et al. (2026) ProtmRNA: Cross-Modal Knowledge Transfer from Proteins to Messenger RNA. bioRxiv.
DOI: 10.64898/2026.05.19.726141