bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

scDisInFact

Zhang Lab

Disentangled VAE framework for joint batch correction, condition-key-gene detection, and perturbation prediction in multi-batch multi-condition scRNA-seq data.

Released: 2024

Overview

Single-cell RNA sequencing studies increasingly involve samples collected across multiple experimental batches and biological conditions — for example, patients at different disease severity levels or cells treated with different perturbations. A persistent challenge in analyzing such datasets is that the observed differences in gene expression reflect both unwanted technical batch effects and meaningful biological condition effects. Standard batch correction methods conflate these two sources of variation, inadvertently erasing biologically informative signals while removing technical noise.

scDisInFact (single-cell Disentangled Integration preserving condition-specific Factors) addresses this problem through a disentangled variational autoencoder (VAE) framework that explicitly separates batch effects from condition-specific biological variation. Developed by Xiuwei Zhang's lab at Georgia Tech and published in Nature Communications in January 2024, the method is the first to simultaneously perform three interdependent tasks on multi-batch, multi-condition scRNA-seq data: batch effect removal, identification of condition-associated key genes (CKGs), and perturbation prediction.

By modeling these tasks within a unified probabilistic framework, scDisInFact enables researchers to cleanly integrate data across heterogeneous experimental designs without sacrificing the condition-specific signals needed for downstream biological interpretation.

Key Features

  • Simultaneous multi-task learning: Performs batch correction, condition-key-gene detection, and perturbation prediction in a single unified framework, rather than requiring separate specialized tools run sequentially.
  • Disentangled latent representation: Decomposes cell state into three orthogonal factor groups — shared biological variation (condition-independent), condition-specific biological variation, and technical batch effects — ensuring that batch correction does not erase condition signals.
  • Multi-condition support: Handles datasets with multiple distinct condition types simultaneously (e.g., disease severity and treatment status measured in the same experiment), each represented by its own set of unshared latent factors.
  • Condition-associated key gene detection: Identifies genes whose expression is specifically driven by each condition type, providing a built-in differential expression analog that is batch-aware and more accurate than standard Wilcoxon rank-sum tests in multi-batch settings.
  • Perturbation prediction: Predicts the transcriptional response of cells to unseen conditions by manipulating the condition-specific latent factors, enabling in silico exploration of perturbation effects.
  • Flexible PyTorch implementation: Implemented in PyTorch with support for GPU acceleration; accepts standard count matrices and cell metadata (batch ID, condition labels) as input, with a demo notebook included for rapid onboarding.

Technical Details

scDisInFact is built on a variational autoencoder architecture with a carefully structured encoder-decoder design. A shared encoder processes all cells to extract a shared biological latent factor (dimension Ks) that captures cell-type identity and other condition-independent variation. In parallel, a set of condition-specific encoders — one per condition type — extract unshared biological latent factors (Ku dimensions each) that encode condition-driven variation. A separate batch encoder captures technical variation as a batch latent factor. A discriminator network enforces disentanglement between shared biological and batch factors during training, preventing the model from encoding batch information into the shared biological space.

Benchmarking was performed on simulated datasets (nine configurations varying batch count, cell type number, and perturbation strength), a glioblastoma (GBM) dataset comprising 21 expression matrices from six patient batches (GSE148842), and a COVID-19 multi-study dataset with age and severity as condition variables. Across these benchmarks, scDisInFact consistently outperformed leading alternatives — including scINSIGHT, scGen, and scPreGAN — on all three tasks, achieving higher adjusted Rand index (ARI) scores for cell type clustering, higher AUPRC for condition-key-gene detection, and lower mean squared error (MSE) with higher Pearson correlation for perturbation prediction.

Applications

scDisInFact is well-suited for disease studies where patient cohorts span multiple collection sites or timepoints and where understanding condition-specific transcriptional programs is central to the scientific question. Concrete applications include integrating multi-site patient cohorts for disease subtyping (e.g., stratifying cancer or COVID-19 severity classes), identifying genes that drive disease progression or treatment response while controlling for batch, and predicting how cell populations will respond to drug perturbations or disease states not yet measured experimentally. The framework is particularly valuable in translational research pipelines where confounded batch and condition effects have historically led to spurious findings or discarded data.

Impact

Published in Nature Communications (volume 15, article 912, 2024), scDisInFact addresses a methodological gap that has limited the interpretability of large-scale multi-condition scRNA-seq atlases. By providing a principled probabilistic decomposition of batch and condition effects, the method improves the reliability of downstream analyses — from differential expression to perturbation modeling — in exactly the complex experimental settings that characterize modern disease genomics. A key limitation is that the method requires explicit condition labels for all cells; it is not designed for unsupervised discovery of condition structure. Additionally, like most VAE-based approaches, interpretability of the latent space requires care, and very large datasets may demand substantial GPU memory. Nonetheless, scDisInFact represents a meaningful advance in the growing toolkit for harmonizing heterogeneous single-cell datasets while preserving biologically meaningful variation.

Sources:

  • scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data | Nature Communications
  • GitHub - ZhangLabGT/scDisInFact
  • PubMed entry (PMID 38291052)

Citation

scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

Zhang, Z., et al. (2023) scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data. bioRxiv.

DOI: 10.1038/s41467-024-45227-w

Metrics

GitHub

Stars13
Forks3
Open Issues2
Contributors3
Last Push1y ago
LanguageJupyter Notebook
LicenseGPL-3.0

Citations

Total Citations20
Influential3
References48

Tags

batch correctiongene expressionvariational autoencoderperturbationtranscriptomics

Resources

GitHub RepositoryResearch Paper