HomiGen Intelligence Technology Co., Ltd.
A transcriptome foundation model for precision oncology that generalizes zero-shot across tissue, plasma cfRNA, and tumor-educated platelet biopsy modalities.
RNABag is a transcriptome foundation model for precision oncology, developed by HomiGen Intelligence Technology Co., Ltd. and released as a bioRxiv preprint in April 2026. It targets a long-standing obstacle in applying RNA-seq to clinical inference: transcriptomic measurements are exquisitely sensitive to cancer state and progression, yet conventional analyses are undermined by technical batch effects and poor generalization across sequencing platforms, cohorts, and sample types.
The central design choice that distinguishes RNABag is its emphasis on robustness over raw capacity. Rather than ingesting the full transcriptome, the model restricts attention to a curated set of highly variable genes to suppress noise, and is pretrained with extensive data augmentation so that its learned representations are invariant to batch variation. This makes the model unusually portable: the authors report strong zero-shot generalization to external cohorts and to in-house clinical samples, in addition to internal validation performance.
RNABag sits alongside bulk RNA-seq foundation models for cancer such as GeneBag, but is differentiated by its explicit cross-modality reach. The same pretrained backbone is applied not only to solid-tissue RNA-seq but also to two liquid-biopsy modalities — plasma cell-free RNA (cfRNA) and tumor-educated platelets (TEPs) — positioning it as a unified model for both invasive and non-invasive cancer monitoring.
.ckpt weights and a unified inference pipeline are provided publicly under an MIT license.RNABag uses a transformer-based architecture that operates on bulk RNA-seq expression profiles. Inputs are provided as FPKM expression matrices and preprocessed via gene-annotation mapping, filtering, log1p normalization, and reduction to a curated panel of 4,096 highly variable genes (selected using TCGA-derived criteria). Pretraining relies on heavy data augmentation to learn batch-invariant representations, after which the model is fine-tuned for individual downstream tasks. The public repository ships separate checkpoints for tissue cancer detection, tissue origin identification, plasma cancer detection, platelet cancer detection, and platelet tumor localization, each invoked through a unified main.py inference entry point. The authors report superior performance in pan-cancer tissue-of-origin classification and cancer detection on internal validation sets, with high diagnostic accuracy carried over to plasma cfRNA and TEP samples; precise parameter counts and per-task benchmark metrics are not detailed in the public documentation.
RNABag is aimed at computational oncology and liquid-biopsy research. Its tissue-based tasks — pan-cancer tissue-of-origin classification, cancer detection, survival stratification, and relapse-risk prediction — support tumor characterization and prognosis from solid-biopsy RNA-seq, while its plasma cfRNA and tumor-educated platelet capabilities extend toward non-invasive, blood-based cancer detection and monitoring. Because the released checkpoints generalize zero-shot to external cohorts, the model is well suited to groups seeking a portable starting point for transcriptome-based diagnostics without assembling large platform-matched training sets. Interpretability analysis additionally surfaced tumor immune escape as a driver of cancer-induced plasma cfRNA signals, offering biological hypotheses alongside predictions.
RNABag contributes to a growing line of transcriptome foundation models that treat bulk RNA-seq as a substrate for clinical inference, and it is notable for tackling the generalization gap that has limited the clinical translation of such models. By demonstrating that a batch-invariant, highly-variable-gene model can transfer across tissue, plasma cfRNA, and platelet modalities, it argues for unified rather than modality-specific approaches to cancer monitoring. As a 2026 preprint with publicly released checkpoints and inference code but without reported parameter counts or fully detailed benchmarks, its claims await peer review and broader independent validation; nonetheless, the cross-modality, zero-shot framing and open release make it a useful reference point for researchers building liquid-biopsy diagnostics.