Multipurpose generative adversarial network trained once on single-cell and bulk RNA-seq to perform stratification, marker analysis, data generation, and vectorization.
RNAGAN is a generative adversarial network for human RNA sequencing analysis, developed by Zhaozheng Hou, Wei Dai, and colleagues at the University of Hong Kong and posted to bioRxiv in March 2026. The model addresses a recurring inefficiency in transcriptomic machine learning: practitioners typically train a separate model for each downstream task—classification, marker discovery, data augmentation, and feature extraction. RNAGAN instead packages these capabilities into a single shared adversarial training procedure, encapsulated by the paper's title, "Train One and Get Four."
The model is trained jointly on single-cell and bulk RNA-seq, drawing on roughly 4.6 million single cells spanning multiple organs and about 5,900 bulk samples covering various cancer types alongside normal references. By learning a common generative representation across both data modalities, RNAGAN can be reused for sample stratification, marker analysis, synthetic ("pseudo") data generation, and vectorization of expression profiles without retraining a bespoke architecture for each. Its emphasis on interpretability and small-sample robustness positions it as a practical tool for laboratories that lack large bespoke training cohorts.
RNAGAN follows a classic generator–discriminator GAN structure augmented with a
pathway neural layer that maps expression into curated or learned pathway
activities. Training data comprise approximately 4.6 million single cells from
public human atlases across multiple organs and roughly 5,900 bulk cancer/normal
samples. The implementation is primarily MATLAB (with Python components) and ships
with pretrained weights as both .mat files and a TensorFlow-exported weights.h5,
allowing reuse outside the MATLAB environment. The code is released under GPL-3.0;
documentation is provided as a bundled PDF and inline comments within each .m
file.
RNAGAN is aimed at researchers analyzing human transcriptomes who want a single reusable model rather than a stack of task-specific tools. Concrete uses include stratifying tumor or tissue samples, identifying marker genes and pathway activities, generating synthetic expression profiles to augment small datasets, and producing fixed-length vector embeddings of cells or samples for downstream clustering and classification. The pathway layer makes it particularly suited to studies that need biologically interpretable features rather than opaque embeddings.
By consolidating four common transcriptomic workflows into one trained model, RNAGAN offers a pragmatic alternative to the proliferation of single-purpose tools in single-cell and bulk RNA-seq analysis. Its interpretability-focused pathway layer and small-data emphasis target groups without the resources to train large bespoke models. As a recent preprint, its real-world adoption and head-to-head benchmarking against established single-cell foundation models remain to be established, and its MATLAB-centric implementation may limit integration into Python-dominant pipelines.