scTranslator

Generative transformer that translates single-cell transcriptomes into proteomes, inferring missing protein abundance from RNA expression alone.

Released: July 2023

The central dogma of molecular biology — DNA is transcribed to RNA, which is translated to protein — does not guarantee that measuring RNA reliably predicts protein abundance. Post-transcriptional regulation, protein stability, translation efficiency, and the sheer technical limitations of current single-cell proteomics technologies mean that the relationship between the single-cell transcriptome and proteome is complex, noisy, and incompletely understood. scTranslator (single-cell translator) approaches this problem from a generative modeling perspective, training a large pre-trained model on paired transcriptome-proteome measurements to learn the statistical relationship between RNA and protein at the single-cell level, then applying this model to infer missing proteome information from transcriptome data alone.

scTranslator was developed by Lihua Liu, Wei Li, Kwan-Yee Wong, Fan Yang, and Jianhua Yao at Tencent AI Lab Healthcare in Shenzhen, China. The preprint was posted on bioRxiv in July 2023 and the work was subsequently published in Nature Biomedical Engineering. The model drew explicit inspiration from the translation paradigm in natural language processing, where a pre-trained encoder-decoder architecture learns to map sequences from a source language to a target language — here applied to the biological "languages" of RNA and protein expression at the single-cell level.

A key innovation of scTranslator is that it is align-free: unlike regression-based approaches that require paired measurements at the individual cell level and must match cells across modalities, scTranslator learns the RNA-to-protein mapping from bulk and single-cell paired datasets and generalizes this learned translation to individual cells without requiring explicit cell-level pairing. This is practically important because generating truly paired single-cell RNA and protein measurements in the same cells (as in CITE-seq or REAP-seq) is technically demanding and expensive, while bulk paired proteomics and transcriptomics data is far more abundant.

Key Features

Align-free translation: scTranslator does not require matched RNA-protein measurements from the same individual cells; instead, it learns the translation relationship from datasets where bulk or single-cell paired measurements are available and applies the model to impute protein abundance from RNA alone.
Large-scale pre-training: The model was pre-trained on over 2 million human single-cell RNA-seq profiles and approximately 18,000 bulk RNA-protein paired samples, enabling it to learn general RNA-protein co-variation patterns across diverse cell types and conditions.
Multi-platform flexibility: Systematic benchmarking confirmed accuracy and stability across multiple single-cell multi-omics quantification platforms including CITE-seq, spatial CITE-seq, REAP-seq, and NEAT-seq, demonstrating that the model generalizes across measurement technologies.
Downstream task augmentation: By providing predicted protein abundance alongside measured RNA, scTranslator improves performance on downstream analyses including cell clustering, cell origin identification in pan-cancer data, and interaction inference.
Gene pseudo-knockout analysis: The model supports in silico perturbation via gene pseudo-knockout, enabling researchers to predict the proteomic consequences of eliminating a specific gene's expression without experimental intervention.
Batch correction capability: scTranslator's learned representation implicitly handles batch effects across datasets, enabling integration of predicted protein profiles from multiple sources.

Technical Details

scTranslator uses a pre-trained transformer encoder-decoder architecture inspired by the translation models used in NLP. The encoder processes the input single-cell transcriptome — represented as log-normalized expression values across approximately 18,000 protein-coding genes — and maps it to a latent representation. The decoder then generates the predicted protein abundance profile, modeled over a panel of proteins that varies depending on the multi-omics platform used for validation (CITE-seq typically measures 100–400 surface protein markers). The pre-training objective uses a reconstruction loss comparing predicted protein values against measured protein values in paired training data.

The pre-training dataset comprised over 2 million human single-cell profiles from public repositories combined with approximately 18,000 bulk paired RNA-protein samples. This large and diverse pre-training corpus enables the model to learn general RNA-to-protein translation rules that extend beyond any specific cell type or tissue. For CITE-seq data specifically, the model was evaluated on held-out panels of surface proteins and transcription factors, demonstrating strong correlation between predicted and measured protein abundance across multiple cell types. On the NeurIPS 2021 multimodal challenge dataset (approximately 60,000 paired CITE-seq measurements from PBMCs), scTranslator achieved competitive performance against supervised baselines while using only the RNA modality as input. Additionally, batch-corrected protein predictions from scTranslator were shown to improve downstream clustering accuracy compared to using raw RNA data alone.

Applications

scTranslator is particularly valuable in contexts where only transcriptomics data is available but protein-level information would improve the analysis. The most direct application is augmenting existing scRNA-seq datasets with predicted protein abundance, effectively converting a single-modality experiment into a pseudo-multimodal analysis without additional sequencing. This is useful for studies of surface marker expression and cell type identification in tissues where CITE-seq data is unavailable, and for retrospective analysis of large existing scRNA-seq cohorts. In pan-cancer research, scTranslator was shown to improve cell origin recognition — identifying the tissue of origin for cancer cells — by providing predicted protein features that complement RNA-based classifiers. The gene pseudo-knockout capability enables researchers to predict the proteomic consequences of potential therapeutic interventions targeting specific genes, supporting target prioritization in drug discovery.

Impact

scTranslator represents an early and influential demonstration of the "translation" paradigm applied to biological multi-omics: treating the conversion between molecular measurement modalities as a sequence-to-sequence translation problem amenable to large pre-trained transformer models. The work's publication in Nature Biomedical Engineering established it as a notable contribution to the growing field of single-cell multi-omics integration. By working in an align-free fashion, scTranslator sidesteps one of the key practical barriers to learning RNA-protein relationships at the single-cell level, making the approach applicable to a much wider range of existing datasets. The framework also opens a conceptual direction for extending translation-style models to other molecular modalities — chromatin accessibility, metabolomics, spatial transcriptomics — following the same pre-training-then-translation template. A limitation of the current model is that it is best validated on surface protein panels measured by CITE-seq, and its accuracy for predicting intracellular protein abundance from RNA remains less thoroughly characterized.

Sources:

Citation

A pre-trained large generative model for translating single-cell transcriptomes to proteomes.

Liu, L., et al. (2025) A pre-trained large generative model for translating single-cell transcriptomes to proteomes.. Nature Biomedical Engineering.

DOI: 10.1038/s41551-025-01528-z

Recent citations

Papers that recently cited this model.

Advancing bioinformatics with language models: components, applications, and perspectives
Jiajia Liu, Mengyuan Yang, Yankai Yu, et al.
Briefings in Bioinformatics · Jul 2026
0
StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction
Peiting Shi, Ningfeng Que, Xianzhen Huang, et al.
May 2026
0
Single-Cell Re-Mining Identifies a Pathogenic Fibroblast Subpopulation and Nominates Exploratory Therapeutic Hypotheses in Human Bone Nonunion
Chang Lei, Hang Chen, Xiao Liu, et al.
Regenerative Medicine and Dentistry · Apr 2026
0

Top citations

The most-cited papers that cite this model.

Beyond alignment: synergistic integration is required for multimodal cell foundation models
Till Richter, Eric Zimmermann, J. Hall, et al.
bioRxiv · Mar 2026
1
Advancing bioinformatics with language models: components, applications, and perspectives
Jiajia Liu, Mengyuan Yang, Yankai Yu, et al.
Briefings in Bioinformatics · Jul 2026
0
StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction
Peiting Shi, Ningfeng Que, Xianzhen Huang, et al.
May 2026
0
Single-Cell Re-Mining Identifies a Pathogenic Fibroblast Subpopulation and Nominates Exploratory Therapeutic Hypotheses in Human Bone Nonunion
Chang Lei, Hang Chen, Xiao Liu, et al.
Regenerative Medicine and Dentistry · Apr 2026
0
Single‐Cell and Spatial Omics: Methods and Applications
Xiaoping Cen, Xiaolan Huang, Enjin Deng, et al.
MedComm · Apr 2026
0

Citations

Total Citations8

Influential0

References73

GitHub

Stars97

Forks13

Open Issues11

Contributors1

Last Push11mo ago

LanguageJupyter Notebook

Fields of citing research

Biology100%
Computer Science86%
Medicine57%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

33Closed

Usability — can I run it?34

Reproducibility — can I retrain it?20

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Research Paper

Key Features

Align-free translation: scTranslator does not require matched RNA-protein measurements from the same individual cells; instead, it learns the translation relationship from datasets where bulk or single-cell paired measurements are available and applies the model to impute protein abundance from RNA alone.

Large-scale pre-training: The model was pre-trained on over 2 million human single-cell RNA-seq profiles and approximately 18,000 bulk RNA-protein paired samples, enabling it to learn general RNA-protein co-variation patterns across diverse cell types and conditions.

Multi-platform flexibility: Systematic benchmarking confirmed accuracy and stability across multiple single-cell multi-omics quantification platforms including CITE-seq, spatial CITE-seq, REAP-seq, and NEAT-seq, demonstrating that the model generalizes across measurement technologies.

Downstream task augmentation: By providing predicted protein abundance alongside measured RNA, scTranslator improves performance on downstream analyses including cell clustering, cell origin identification in pan-cancer data, and interaction inference.

Gene pseudo-knockout analysis: The model supports in silico perturbation via gene pseudo-knockout, enabling researchers to predict the proteomic consequences of eliminating a specific gene's expression without experimental intervention.

Batch correction capability: scTranslator's learned representation implicitly handles batch effects across datasets, enabling integration of predicted protein profiles from multiple sources.

Technical Details

Applications

Impact

Sources:

Recent citations

Papers that recently cited this model.

scTranslator

#Key Features

#Technical Details

#Applications

#Impact

Citation

A pre-trained large generative model for translating single-cell transcriptomes to proteomes.

Recent citations

StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

Top citations

StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

scTranslator

#Key Features

#Technical Details

#Applications

#Impact

Citation

A pre-trained large generative model for translating single-cell transcriptomes to proteomes.

Recent citations

StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

Top citations

StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact