Max Delbrück Center for Molecular Medicine
A supervised variational autoencoder that learns a unified, tissue-aware latent representation of bulk RNA-seq, compressing 16,115 genes into 121 dimensions across 42 tissues.
The Flexynesis Tissue VAE is a supervised variational autoencoder that learns a unified, tissue-aware representation of bulk RNA-seq data. Despite the maturity of bulk transcriptomics, the field has lacked a single shared latent space that generalizes across the major public expression compendia. This model addresses that gap by training one encoder on harmonized data drawn from TCGA, GTEx, and ARCHS4, yielding compact embeddings that capture tissue identity while remaining robust to the technical heterogeneity that separates those resources.
Developed by Aakriti Pande, Bora Uyar, and Altuna Akalin at the Berlin Institute for Medical Systems Biology (BIMSB) of the Max Delbrück Center for Molecular Medicine, the work was posted to bioRxiv in June 2026. The model is presented in the paper "An atlas-scale generative model for unified representation learning of bulk RNA-seq data."
The name reflects the model's origin within the broader Flexynesis ecosystem, a deep-learning toolkit from the same lab for bulk multi-omics integration in precision oncology. The memory-safe HDF5 data loader built for this study was contributed upstream to Flexynesis. The tissue VAE itself, however, is a standalone, single-modality model focused on bulk RNA-seq representation, and is cataloged here as such.
The architecture is a supervised variational autoencoder, with a denoising
variant for added robustness, trained on an HDF5 compendium of 118,263 training
and 28,274 test samples. Inputs are 16,115 genes shared across the source
datasets; the encoder produces a 121-dimensional latent vector that is jointly
optimized for reconstruction and for predicting one of 42 tissue categories.
On held-out data the model attains 94.9% balanced accuracy and 96.2% weighted F1
for tissue-of-origin classification. External validation on 734 pediatric tumor
samples from the TARGET project yields 84.6% agreement, demonstrating that the
representation transfers to an out-of-distribution cohort distinct from the adult
tissues that dominate training. The code is MIT-licensed; trained weights
(vae_tissue.final_model.pth), the HDF5 compendium, and precomputed embeddings
are released on Zenodo under CC-BY-4.0.
The model provides ready-to-use embeddings for bulk RNA-seq, supporting tissue-of-origin prediction, sample quality control, and the placement of new samples within a shared reference space. Because the latent representation is compact and tissue-aware, it can serve as a feature extractor for downstream classifiers or as a screening tool for cancers of unknown primary, where identifying the likely originating tissue informs diagnosis. An interactive web application lets researchers embed their own expression profiles without configuring the training pipeline, lowering the barrier for wet-lab groups.
By offering a single, validated latent space spanning the three most widely used bulk expression atlases, the Flexynesis Tissue VAE reduces the need for ad hoc, dataset-specific embeddings and gives the bulk RNA-seq community a reusable representation analogous to those that foundation models have provided for single-cell data. Its transfer to pediatric tumors suggests utility beyond the adult-dominated training distribution, though performance on rarer tissues and on heavily perturbed disease states remains to be characterized more fully. As a recent preprint, its long-term adoption is still emerging, but the public checkpoint, permissive code license, and live demo position it for immediate practical use.
Pande, A., et al. (2026) An atlas-scale generative model for unified representation learning of bulk RNA-seq data. openRxiv.
DOI: 10.64898/2026.06.18.733198Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data