Disentangled VAE framework for joint batch correction, condition-key-gene detection, and perturbation prediction in multi-batch multi-condition scRNA-seq data.
Single-cell RNA sequencing studies increasingly involve samples collected across multiple experimental batches and biological conditions — for example, patients at different disease severity levels or cells treated with different perturbations. A persistent challenge in analyzing such datasets is that the observed differences in gene expression reflect both unwanted technical batch effects and meaningful biological condition effects. Standard batch correction methods conflate these two sources of variation, inadvertently erasing biologically informative signals while removing technical noise.
scDisInFact (single-cell Disentangled Integration preserving condition-specific Factors) addresses this problem through a disentangled variational autoencoder (VAE) framework that explicitly separates batch effects from condition-specific biological variation. Developed by Xiuwei Zhang's lab at Georgia Tech and published in Nature Communications in January 2024, the method is the first to simultaneously perform three interdependent tasks on multi-batch, multi-condition scRNA-seq data: batch effect removal, identification of condition-associated key genes (CKGs), and perturbation prediction.
By modeling these tasks within a unified probabilistic framework, scDisInFact enables researchers to cleanly integrate data across heterogeneous experimental designs without sacrificing the condition-specific signals needed for downstream biological interpretation.
scDisInFact is built on a variational autoencoder architecture with a carefully structured encoder-decoder design. A shared encoder processes all cells to extract a shared biological latent factor (dimension Ks) that captures cell-type identity and other condition-independent variation. In parallel, a set of condition-specific encoders — one per condition type — extract unshared biological latent factors (Ku dimensions each) that encode condition-driven variation. A separate batch encoder captures technical variation as a batch latent factor. A discriminator network enforces disentanglement between shared biological and batch factors during training, preventing the model from encoding batch information into the shared biological space.
Benchmarking was performed on simulated datasets (nine configurations varying batch count, cell type number, and perturbation strength), a glioblastoma (GBM) dataset comprising 21 expression matrices from six patient batches (GSE148842), and a COVID-19 multi-study dataset with age and severity as condition variables. Across these benchmarks, scDisInFact consistently outperformed leading alternatives — including scINSIGHT, scGen, and scPreGAN — on all three tasks, achieving higher adjusted Rand index (ARI) scores for cell type clustering, higher AUPRC for condition-key-gene detection, and lower mean squared error (MSE) with higher Pearson correlation for perturbation prediction.
scDisInFact is well-suited for disease studies where patient cohorts span multiple collection sites or timepoints and where understanding condition-specific transcriptional programs is central to the scientific question. Concrete applications include integrating multi-site patient cohorts for disease subtyping (e.g., stratifying cancer or COVID-19 severity classes), identifying genes that drive disease progression or treatment response while controlling for batch, and predicting how cell populations will respond to drug perturbations or disease states not yet measured experimentally. The framework is particularly valuable in translational research pipelines where confounded batch and condition effects have historically led to spurious findings or discarded data.
Published in Nature Communications (volume 15, article 912, 2024), scDisInFact addresses a methodological gap that has limited the interpretability of large-scale multi-condition scRNA-seq atlases. By providing a principled probabilistic decomposition of batch and condition effects, the method improves the reliability of downstream analyses — from differential expression to perturbation modeling — in exactly the complex experimental settings that characterize modern disease genomics. A key limitation is that the method requires explicit condition labels for all cells; it is not designed for unsupervised discovery of condition structure. Additionally, like most VAE-based approaches, interpretability of the latent space requires care, and very large datasets may demand substantial GPU memory. Nonetheless, scDisInFact represents a meaningful advance in the growing toolkit for harmonizing heterogeneous single-cell datasets while preserving biologically meaningful variation.
Sources:
Zhang, Z., et al. (2023) scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data. bioRxiv.
DOI: 10.1038/s41467-024-45227-w