RNABag

HomiGen Intelligence Technology Co., Ltd.

Transcriptome foundation model for precision oncology, generalizing zero-shot across tissue, plasma cfRNA, and tumor-educated platelet modalities.

Released: April 2026

RNABag is a transcriptome foundation model for precision oncology, developed by HomiGen Intelligence Technology Co., Ltd. and released as a bioRxiv preprint in April 2026. It targets a long-standing obstacle in applying RNA-seq to clinical inference: transcriptomic measurements are exquisitely sensitive to cancer state and progression, yet conventional analyses are undermined by technical batch effects and poor generalization across sequencing platforms, cohorts, and sample types.

The central design choice that distinguishes RNABag is its emphasis on robustness over raw capacity. Rather than ingesting the full transcriptome, the model restricts attention to a curated set of highly variable genes to suppress noise, and is pretrained with extensive data augmentation so that its learned representations are invariant to batch variation. This makes the model unusually portable: the authors report strong zero-shot generalization to external cohorts and to in-house clinical samples, in addition to internal validation performance.

RNABag sits alongside bulk RNA-seq foundation models for cancer such as GeneBag, but is differentiated by its explicit cross-modality reach. The same pretrained backbone is applied not only to solid-tissue RNA-seq but also to two liquid-biopsy modalities — plasma cell-free RNA (cfRNA) and tumor-educated platelets (TEPs) — positioning it as a unified model for both invasive and non-invasive cancer monitoring.

Key Features

Cross-modality generalization: A single transcriptome model spans tissue RNA-seq, plasma cfRNA, and tumor-educated platelet biopsies, supporting both diagnostic and non-invasive monitoring workflows.
Batch-invariant pretraining: Extensive data augmentation during pretraining encourages representations that are robust to platform and batch effects, a primary cause of failure in transcriptome-based models.
Highly-variable-gene focus: Restricting input to a fixed panel of high-variability genes reduces noise and improves transfer to unseen datasets.
Zero-shot external validation: Reports strong performance on external cohorts and in-house clinical samples without task-specific retraining, plus stronger results after specialized fine-tuning.
Clinical-task breadth: After fine-tuning, supports pan-cancer tissue-of-origin classification, cancer detection, survival stratification, and relapse-risk prediction.
Released checkpoints and inference code: Task-specific .ckpt weights and a unified inference pipeline are provided publicly under an MIT license.

Technical Details

RNABag uses a transformer-based architecture that operates on bulk RNA-seq expression profiles. Inputs are provided as FPKM expression matrices and preprocessed via gene-annotation mapping, filtering, log1p normalization, and reduction to a curated panel of 4,096 highly variable genes (selected using TCGA-derived criteria). Pretraining relies on heavy data augmentation to learn batch-invariant representations, after which the model is fine-tuned for individual downstream tasks. The public repository ships separate checkpoints for tissue cancer detection, tissue origin identification, plasma cancer detection, platelet cancer detection, and platelet tumor localization, each invoked through a unified main.py inference entry point. The authors report superior performance in pan-cancer tissue-of-origin classification and cancer detection on internal validation sets, with high diagnostic accuracy carried over to plasma cfRNA and TEP samples; precise parameter counts and per-task benchmark metrics are not detailed in the public documentation.

Applications

RNABag is aimed at computational oncology and liquid-biopsy research. Its tissue-based tasks — pan-cancer tissue-of-origin classification, cancer detection, survival stratification, and relapse-risk prediction — support tumor characterization and prognosis from solid-biopsy RNA-seq, while its plasma cfRNA and tumor-educated platelet capabilities extend toward non-invasive, blood-based cancer detection and monitoring. Because the released checkpoints generalize zero-shot to external cohorts, the model is well suited to groups seeking a portable starting point for transcriptome-based diagnostics without assembling large platform-matched training sets. Interpretability analysis additionally surfaced tumor immune escape as a driver of cancer-induced plasma cfRNA signals, offering biological hypotheses alongside predictions.

Impact

RNABag contributes to a growing line of transcriptome foundation models that treat bulk RNA-seq as a substrate for clinical inference, and it is notable for tackling the generalization gap that has limited the clinical translation of such models. By demonstrating that a batch-invariant, highly-variable-gene model can transfer across tissue, plasma cfRNA, and platelet modalities, it argues for unified rather than modality-specific approaches to cancer monitoring. As a 2026 preprint with publicly released checkpoints and inference code but without reported parameter counts or fully detailed benchmarks, its claims await peer review and broader independent validation; nonetheless, the cross-modality, zero-shot framing and open release make it a useful reference point for researchers building liquid-biopsy diagnostics.

Citation

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

Luo, P., et al. (2026) RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities. bioRxiv.

DOI: 10.64898/2026.04.19.719450

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References48

GitHub

Stars0

Forks0

Open Issues0

Contributors1

Last Push2mo ago

LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

46Partial

Usability — can I run it?85

Reproducibility — can I retrain it?12

open weights, closed recipe

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Official Website

Key Features

Cross-modality generalization: A single transcriptome model spans tissue RNA-seq, plasma cfRNA, and tumor-educated platelet biopsies, supporting both diagnostic and non-invasive monitoring workflows.

Batch-invariant pretraining: Extensive data augmentation during pretraining encourages representations that are robust to platform and batch effects, a primary cause of failure in transcriptome-based models.

Highly-variable-gene focus: Restricting input to a fixed panel of high-variability genes reduces noise and improves transfer to unseen datasets.

Zero-shot external validation: Reports strong performance on external cohorts and in-house clinical samples without task-specific retraining, plus stronger results after specialized fine-tuning.

Clinical-task breadth: After fine-tuning, supports pan-cancer tissue-of-origin classification, cancer detection, survival stratification, and relapse-risk prediction.

Released checkpoints and inference code: Task-specific .ckpt weights and a unified inference pipeline are provided publicly under an MIT license.

Technical Details

Applications

Impact

RNABag

Key Features

Technical Details

Applications

Impact

Citation

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

RNABag

Key Features

Technical Details

Applications

Impact

Citation

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

RNABag

#Key Features

#Technical Details

#Applications

#Impact

Citation

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

RNABag

#Key Features

#Technical Details

#Applications

#Impact

Citation

RNABag: A Generalizable Transcriptome Foundation Model for Precision Oncology across Biopsy Modalities

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact