bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Single-cell

XA4C

University of Calgary

Explainable autoencoder for transcriptome analysis that identifies critical genes using SHAP-based attribution of neural network latent representations.

Released: 2023

Overview

XA4C (eXplainable Autoencoder for Critical genes) is a computational framework for interpretable transcriptome analysis developed by Qing Li, Yang Yu, Pathum Kossinna, Theodore Lun, Wenyuan Liao, and Qingrun Zhang at the University of Calgary. Published as a bioRxiv preprint in July 2023, XA4C addresses a longstanding gap between the power of representation learning for transcriptomics and the need for biologically interpretable gene-level insights. While autoencoders and other deep learning models have proven effective at learning compact representations of high-dimensional gene expression data, the resulting latent variables are difficult to interpret, and identifying which specific genes drive the learned representations — and therefore which genes are biologically important — has remained an open challenge.

The central contribution of XA4C is the concept of "Critical genes": genes that contribute disproportionately to the learned latent variables of an autoencoder, as quantified by SHAP (SHapley Additive exPlanations) values. By applying XAI (eXplainable Artificial Intelligence) techniques — specifically the SHAP framework from game theory — to the internal activations of an autoencoder trained on RNA-seq data, XA4C provides each gene with a quantitative importance score for each latent dimension. This score reflects how much that gene's expression level influences the model's compressed representation of a sample, going beyond simple correlation or variance-based gene selection to capture complex, nonlinear gene-gene interactions encoded in the network.

The biological motivation is that traditional gene selection methods — differentially expressed (DiffEx) genes, differentially co-expressed (DiffCoEx) genes, and hub genes from co-expression networks — each capture a specific statistical property of the transcriptome but may miss important genes whose role emerges through complex interaction patterns that are beyond marginal effects or pairwise correlations. The autoencoder, by learning a compressed representation of the full transcriptome, potentially captures higher-order gene interaction patterns; XA4C makes these patterns interpretable by attributing the learned representation back to individual genes.

Key Features

  • SHAP-based gene attribution in autoencoders: Applies SHapley Additive exPlanations (SHAP) to compute each gene's quantitative contribution to each latent variable in a trained autoencoder, providing principled attribution of neural network decisions to input features.
  • Critical gene concept: Introduces "Critical genes" as a new category of biologically important genes distinct from differentially expressed genes and hub genes, capturing genes whose importance emerges through complex interaction patterns encoded in learned representations.
  • Discovery of novel disease-relevant genes: Demonstrates that Critical genes have higher enrichment in the DisGeNET comprehensive disease gene database than differentially expressed or hub genes, suggesting they capture biologically relevant variation missed by traditional methods.
  • Application across six cancer types: Applied to RNA-seq data from The Cancer Genome Atlas (TCGA) spanning six different cancer types, identifying Critical genes with cancer-specific interaction patterns and pathway enrichments.
  • Complementarity with traditional methods: Shows empirically that Critical genes have minimal overlap with differentially expressed genes and hub genes, confirming that they represent genuinely new biological information rather than a repackaging of existing methods.
  • Pathway-level interpretation: Critical genes identified by XA4C were found to cluster in specific metabolic and regulatory pathways — such as the Lysine degradation pathway (hsa00310) — with distinct interaction patterns in tumor versus normal tissues, enabling mechanistic biological hypotheses.

Technical Details

XA4C's computational workflow consists of three stages: autoencoder training, SHAP attribution, and Critical gene prioritization. In the first stage, a standard fully connected autoencoder is trained on a gene expression matrix (samples × genes) with a bottleneck latent dimension substantially smaller than the input gene count, learning a compressed representation that captures the major axes of transcriptional variation in the dataset. The encoder maps samples to a latent space and the decoder reconstructs the input, with training minimizing mean squared error reconstruction loss. In the second stage, SHAP values are computed for each gene with respect to each latent variable by treating the encoder as a function from gene expression space to latent space and applying the KernelSHAP or DeepSHAP algorithms to attribute each latent variable's value to individual input genes. This produces a gene × latent-variable SHAP attribution matrix that quantifies how much each gene's expression value influences each learned latent dimension. In the third stage, genes are ranked by their aggregated SHAP importance across latent variables, and the top-ranked genes are designated Critical genes for downstream biological analysis. The framework was validated on TCGA RNA-seq data from six cancer types: breast cancer (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), and prostate adenocarcinoma (PRAD). In each cancer, Critical genes showed significantly higher enrichment in DisGeNET disease gene annotations compared to differentially expressed genes identified by DESeq2 and hub genes from WGCNA co-expression networks.

Applications

XA4C is applicable to any transcriptomic study where the goal is gene prioritization for functional follow-up. Cancer genomics is a primary use case: by applying XA4C to TCGA or similar datasets, researchers can identify genes that are not prominently differentially expressed but nonetheless play important roles in the transcriptional programs distinguishing tumor from normal tissue, potentially revealing drug targets or prognostic biomarkers that evade traditional analysis. Drug response prediction is another application: autoencoders trained on drug treatment expression data could be analyzed with XA4C to identify genes whose expression changes drive the transcriptional response to a drug, providing mechanistic hypotheses for mode-of-action studies. Single-cell RNA-seq is a natural extension of the framework: applying XA4C to autoencoders trained on scRNA-seq data could identify critical genes that define cell-type-specific latent structure, potentially complementing existing cell-type annotation and trajectory analysis workflows. The explainability framework is also applicable to variational autoencoders (VAEs), which are widely used in single-cell biology, by analyzing the attribution of gene expression to latent dimensions in a VAE's encoder.

Impact

XA4C contributes to a growing effort in computational biology to combine the representational power of deep learning with the interpretability requirements of biological discovery. The demonstration that SHAP attributions of autoencoder latent variables identify a distinct and complementary set of disease-relevant genes — with higher DisGeNET enrichment than both differential expression analysis and co-expression network hub genes — makes a substantive empirical case for the biological value of representation-learning-derived gene prioritization. The framework addresses a practical barrier to adoption of deep learning in genomics: the "black box" criticism that latent representations are uninterpretable. By providing gene-level attribution scores grounded in game-theoretic fairness axioms, XA4C offers a transparent accounting of what the network has learned. Key limitations include the computational cost of SHAP attribution for large gene sets (the full transcriptome may require approximations), the dependence on autoencoder architecture choices that may influence which genes are identified as Critical, and the need for biological validation of Critical gene predictions through wet-lab experiments. The work remains a preprint, and peer-reviewed benchmarking against alternative explainability methods such as gradient-based attribution and integrated gradients would strengthen confidence in the approach.

Tags

gene expressioncell type annotationautoencoderself-supervisedrepresentation learninggenomicscancer biology

Resources

GitHub RepositoryResearch Paper