Chan Zuckerberg Initiative
A variational autoencoder pretrained on 74 million human single-cell transcriptomes from the CELLxGENE Census for scalable batch correction, cell typing, and data integration.
Single-cell variational inference (scVI) is one of the foundational methods in modern single-cell genomics — a probabilistic deep generative model that learns a low-dimensional latent representation of gene expression while simultaneously modeling technical noise and batch effects. Published in Nature Methods in 2018 by Romain Lopez, Jeffrey Regier, Michael Cole, Michael Jordan, and Nir Yosef, the original scVI paper established the variational autoencoder (VAE) framework as a principled and scalable approach to single-cell RNA-seq analysis, replacing ad hoc normalization and dimensionality reduction pipelines with a statistically coherent generative model of count data.
The version documented here is a substantially scaled-up deployment of scVI: a model pretrained by the Chan Zuckerberg Initiative on the CZ CELLxGENE Discover Census, a curated and harmonized repository of publicly available human single-cell transcriptomics data. The Census 2024-07-01 release encompasses 74.3 million human single-cell RNA-seq profiles drawn from thousands of studies across dozens of tissues, making it by far the largest publicly available human single-cell transcriptomics dataset. Pretraining scVI on this corpus — rather than on individual study-specific datasets as was common practice — produces a general-purpose latent space that captures the full diversity of human cell types and states, enabling the pretrained model to be applied to new datasets through lightweight query mapping rather than full retraining.
This pretrained scVI model is distributed through the scVI-hub platform (described in a Nature Methods 2025 paper), the CELLxGENE Census documentation, and the CZI Virtual Cell Platform. It provides immediate, compute-efficient access to a reference-scale latent space for standard single-cell analysis tasks including batch correction, cell type prediction, dimensionality reduction, and data integration — tasks that previously required training a bespoke model from scratch for each new dataset.
Variational autoencoder with count-appropriate likelihoods: scVI models raw integer count data using a zero-inflated negative binomial (ZINB) or negative binomial likelihood in the decoder, explicitly accounting for the overdispersion and dropout typical of scRNA-seq data without requiring log-normalization or other preprocessing that distorts count statistics. This probabilistic framework provides principled uncertainty estimates and enables model comparison through the evidence lower bound (ELBO).
Explicit batch effect modeling: Batch identity (sequencing technology, laboratory of origin, suspension type, dataset of origin) is provided as a conditional input to the decoder, allowing the encoder to learn a batch-corrected latent representation while the decoder reconstructs batch-specific technical effects. This approach separates biological signal from technical variation in a principled way, unlike post-hoc correction methods.
Pretrained on 74.3 million human cells from CELLxGENE Census: The model available through the Virtual Cell Platform was pretrained on the 2024-07-01 release of the CZ CELLxGENE Discover Census using cells filtered to primary data with a minimum non-zero count threshold, with the top 8,000 highly variable genes selected stratified by batch variables. This large-scale pretraining produces a reference latent space spanning the full breadth of human cell biology.
Query-to-reference mapping for new datasets: Rather than training a new scVI model from scratch for every new dataset — which requires hours to days of GPU computation — the pretrained Census model enables query mapping: a new scRNA-seq dataset is rapidly mapped onto the existing latent space through a lightweight online learning step, producing batch-corrected embeddings in a fraction of the time. This enables immediate analysis of new data against the full Census reference.
Cell type prediction through label transfer: Because the pretrained latent space encodes well-separated cell type clusters spanning the diversity of human tissues, it supports cell type annotation of new datasets through nearest-neighbor label transfer from the Census reference. Cells in the query dataset are mapped to latent positions and assigned the cell type of their nearest Census neighbors, providing automatic annotation without manual marker gene curation.
Approximately 7.1 million parameters: The scVI architecture is deliberately compact, with approximately 7.1 million parameters in the Census pretrained model. This compactness, combined with stochastic mini-batch optimization, makes the model trainable and deployable on standard single GPU hardware, democratizing access to probabilistic single-cell modeling even for resource-constrained research groups.
Integration with the scvi-tools and scverse ecosystem: The pretrained model is distributed through scVI-tools (the software package implementing scVI and related methods) and is compatible with AnnData and Scanpy, the standard data formats and analysis tools used across the single-cell biology community. This integration means that researchers already working in the Python single-cell ecosystem can adopt the pretrained Census model with minimal workflow changes.
scVI's architecture consists of two neural networks: an encoder that maps a raw gene expression vector (plus batch covariates) to the parameters of a variational posterior over a low-dimensional continuous latent variable z, and a decoder that maps samples from z (plus batch covariates) to the parameters of a gene expression likelihood distribution. The encoder is a multi-layer fully connected network; the decoder is similarly structured. The latent variable z is constrained to follow an approximate standard Gaussian posterior by the KL divergence term in the ELBO, encouraging a smooth, interpretable latent space where geometrically nearby cells share biological properties.
The count likelihood in the decoder is a zero-inflated negative binomial distribution (ZINB) for the original scVI or a negative binomial (NB) distribution in later configurations, with gene-specific dispersion parameters that capture the overdispersion characteristic of scRNA-seq data. Batch covariates are injected into the decoder as one-hot encodings or embeddings, enabling the decoder to learn batch-specific scaling and dispersion while keeping the encoder's latent representation batch-independent. The library size (total UMI count per cell) is also modeled explicitly as a latent variable or treated as an observed covariate, ensuring that differences in sequencing depth do not confound the biological latent representation.
For the CELLxGENE Census pretraining, cells were filtered to include only primary data (is_primary_data == True) with minimum non-zero count thresholds to exclude poor-quality profiles. Gene selection was performed by identifying the top 8,000 highly variable genes stratified by suspension type and assay type — an important step that ensures the selected genes vary biologically across cell types rather than merely varying due to technical platform differences. Batch variables used during training include suspension type (cell vs. nucleus), assay type (10x Chromium, Smart-seq, etc.), and dataset of origin, covering the major sources of technical variation in the Census corpus.
The resulting pretrained model with approximately 7.1 million parameters was validated on benchmark tasks across multiple tissue types. Batch correction metrics evaluated on Census cells from adipose tissue and spinal cord using assay type, dataset, and suspension type batch labels show competitive performance relative to other integration methods. Cell type prediction benchmarks demonstrate high accuracy for annotation of held-out cells through nearest-neighbor transfer in the pretrained latent space.
The Census-pretrained scVI model serves as a general-purpose single-cell analysis backbone for the human single-cell transcriptomics community. Its most immediate application is data integration and batch correction: researchers who have generated scRNA-seq data from a new cohort, tissue, or experimental condition can map their data onto the Census-scale latent space to obtain batch-corrected embeddings for visualization, clustering, and downstream analysis, without spending days training a model on their specific dataset. This is particularly valuable for rare disease research, where patient cohorts are often small and insufficient to train robust single-cell models from scratch; by mapping rare disease cells onto the Census reference, researchers can contextualize their data against the full breadth of healthy human transcriptomics and identify how disease-state cells diverge from healthy reference populations. The cell type prediction capability enables automatic annotation of new datasets, saving significant manual curation effort in large-scale atlas projects. Researchers building multi-study integrated atlases for specific tissues (brain, lung, gut, etc.) can use the pretrained model to harmonize data across studies that used different protocols, cell isolation methods, or sequencing platforms. The probabilistic nature of scVI also makes it applicable to differential expression analysis — the model's learned latent space provides a principled framework for comparing expression distributions between conditions while accounting for technical confounders — and to imputation of genes not captured by a given assay technology.
scVI, from its original 2018 publication to its Census-scale deployment in 2024, represents one of the most influential contributions to the computational single-cell biology toolkit. The original Nature Methods paper established the VAE framework as a standard approach for single-cell analysis, accumulating thousands of citations and inspiring a broad family of related models (scANVI for semi-supervised annotation, totalVI for CITE-seq, scArches for reference mapping, and many others) that are now part of the scverse ecosystem. The decision by CZI to pretrain scVI at Census scale — 74.3 million cells, the largest human single-cell reference corpus available — reflects the model's maturity and the community's confidence in it as a practical foundation for cell biology AI. The distribution through scVI-hub (described in Nature Methods 2025) and the CELLxGENE Census platform makes Census-scale pretrained models immediately accessible to researchers without specialized infrastructure, demonstrating a model for how large precomputed single-cell references can be made broadly usable. A key limitation of scVI relative to newer transformer-based foundation models (Geneformer, scGPT, scFoundation) is its relatively simple fully connected encoder architecture, which may capture less complex gene-gene dependency structure than attention-based models. scVI also models each gene independently in the likelihood, without explicitly modeling the joint distribution over gene pairs, meaning it may miss higher-order statistical dependencies. However, these architectural limitations are offset by scVI's exceptional computational efficiency, theoretical clarity, and long track record of reliable performance across diverse datasets — properties that newer, more complex models have not yet fully matched in practical deployments.