Technical University of Denmark / University of Copenhagen
Variational autoencoder for single-cell RNA-seq that models raw count distributions directly, producing latent cell representations without normalization preprocessing.
Single-cell RNA sequencing (scRNA-seq) captures gene expression across thousands of individual cells simultaneously, but analyzing the resulting data is technically demanding. The measurements are high-dimensional, extremely sparse, and contaminated by dropout events and sequencing depth variation. Most analysis pipelines address this by applying a sequence of normalization and preprocessing transformations before fitting any model — but these steps can distort the underlying count statistics and obscure genuine biological signals.
scVAE, developed by researchers at the Technical University of Denmark and the University of Copenhagen and published in Bioinformatics in 2020, takes a different approach: it models raw integer count data directly using a variational autoencoder (VAE) framework. Rather than normalizing away technical variation, scVAE explicitly parameterizes count distributions — Poisson, negative binomial, or zero-inflated variants — in the decoder, allowing the model to account for noise in a statistically principled way. The result is a low-dimensional latent representation of each cell that captures biological variation while remaining robust to the idiosyncrasies of single-cell count data.
The framework also includes a Gaussian Mixture VAE (GMVAE) extension that incorporates a mixture-of-Gaussians prior in the latent space, encouraging the model to discover discrete cell populations without requiring labeled training data.
scVAE implements the standard VAE framework adapted for discrete count observations. The encoder is a multi-layer fully connected network that maps a raw gene expression vector — dimensionality equal to the number of measured genes, typically 10,000–30,000 — to the parameters of a variational posterior over a low-dimensional Gaussian latent variable. The decoder maps samples from this latent space back to the parameters of a count distribution over all genes; the reconstruction loss is the negative log-likelihood under that distribution rather than mean-squared error.
Three count likelihood families are supported: Poisson (one parameter per gene per cell), negative binomial (adds a dispersion parameter to model overdispersion common in scRNA-seq), and zero-inflated variants of both (adds a dropout probability to model excess zeros from capture inefficiency). The GMVAE variant replaces the isotropic Gaussian prior with a mixture of K Gaussian components; the inference network simultaneously infers a soft cluster assignment and a within-cluster latent coordinate, and the ELBO is modified to include a KL term over the discrete assignment variable.
Training uses the reparameterization trick for gradient estimation through the continuous latent variables. The framework was evaluated on multiple benchmark datasets including MNIST-style synthetic data and real scRNA-seq collections, demonstrating competitive clustering performance relative to specialized dimensionality reduction methods.
scVAE is suited to any scRNA-seq analysis task that benefits from an interpretable probabilistic representation. Cell-type identification is the primary use case: the GMVAE latent space clusters map directly onto biologically coherent populations that can be annotated using marker gene expression. The continuous latent coordinates from the standard VAE support trajectory and pseudotime analyses, making the tool applicable to developmental biology studies where cells span a continuum of differentiation states. Researchers have also used the reconstruction likelihood as a per-cell quality score to identify low-quality or doublet cells prior to downstream analysis. Because the model operates on raw counts without normalization, it is straightforward to apply across datasets generated with different sequencing protocols or depths, reducing a common source of batch confounding.
scVAE contributed to a broader shift in single-cell analysis toward generative probabilistic models that treat technical noise as a structured component of the data rather than something to be removed. The work helped establish negative binomial and zero-inflated likelihoods as appropriate defaults for scRNA-seq decoders — a convention subsequently adopted by influential frameworks including scVI and its successors. The paper has been cited by researchers across developmental biology, immunology, and cancer genomics, reflecting the cross-domain relevance of the approach. A practical limitation is that the fully connected architecture does not scale as efficiently as more recent attention-based models to atlases exceeding millions of cells, and the framework predates dedicated support for multi-modal single-cell data (e.g., joint RNA and protein measurements). Nonetheless, scVAE remains a methodologically clear reference implementation for count-based variational inference in single-cell analysis.
Grønbech, C. H., Vording, M. F., Timshel, P. N., Sønderby, C. K., Pers, T. H., & Winther, O. (2020). scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics, 36(16), 4415-4422.
DOI: 10.1093/bioinformatics/btaa293