The Flexynesis Tissue VAE is a supervised variational autoencoder that learns a unified, tissue-aware representation of bulk RNA-seq data. Despite the maturity of bulk transcriptomics, the field has lacked a single shared latent space that generalizes across the major public expression compendia. This model addresses that gap by training one encoder on harmonized data drawn from TCGA, GTEx, and ARCHS4, yielding compact embeddings that capture tissue identity while remaining robust to the technical heterogeneity that separates those resources.

Developed by Aakriti Pande, Bora Uyar, and Altuna Akalin at the Berlin Institute for Medical Systems Biology (BIMSB) of the Max Delbrück Center for Molecular Medicine, the work was posted to bioRxiv in June 2026. The model is presented in the paper "An atlas-scale generative model for unified representation learning of bulk RNA-seq data."

The name reflects the model's origin within the broader Flexynesis ecosystem, a deep-learning toolkit from the same lab for bulk multi-omics integration in precision oncology. The memory-safe HDF5 data loader built for this study was contributed upstream to Flexynesis. The tissue VAE itself, however, is a standalone, single-modality model focused on bulk RNA-seq representation, and is cataloged here as such.

Key Features

Atlas-scale training: Learns from 118,263 bulk RNA-seq samples spanning TCGA, GTEx, and ARCHS4, harmonized to a common gene space and tissue ontology.
Compact latent space: Compresses 16,115 genes into a 121-dimensional representation, mapped to 42 UBERON tissue categories.
Supervised, denoising design: Tissue labels guide the encoder, and a denoising VAE variant improves robustness to technical noise across cohorts.
Strong tissue-of-origin accuracy: Reaches 94.9% balanced accuracy and a 96.2% weighted F1 on held-out tissue classification.
Independent validation: Generalizes to 734 pediatric TARGET tumor samples with 84.6% agreement against the expected tissue of origin.
Interactive and reusable: Ships with a web demo and a fixed checkpoint that loads without retraining.

Technical Details

The architecture is a supervised variational autoencoder, with a denoising variant for added robustness, trained on an HDF5 compendium of 118,263 training and 28,274 test samples. Inputs are 16,115 genes shared across the source datasets; the encoder produces a 121-dimensional latent vector that is jointly optimized for reconstruction and for predicting one of 42 tissue categories. On held-out data the model attains 94.9% balanced accuracy and 96.2% weighted F1 for tissue-of-origin classification. External validation on 734 pediatric tumor samples from the TARGET project yields 84.6% agreement, demonstrating that the representation transfers to an out-of-distribution cohort distinct from the adult tissues that dominate training. The code is MIT-licensed; trained weights (vae_tissue.final_model.pth), the HDF5 compendium, and precomputed embeddings are released on Zenodo under CC-BY-4.0.

Applications

The model provides ready-to-use embeddings for bulk RNA-seq, supporting tissue-of-origin prediction, sample quality control, and the placement of new samples within a shared reference space. Because the latent representation is compact and tissue-aware, it can serve as a feature extractor for downstream classifiers or as a screening tool for cancers of unknown primary, where identifying the likely originating tissue informs diagnosis. An interactive web application lets researchers embed their own expression profiles without configuring the training pipeline, lowering the barrier for wet-lab groups.

Impact

By offering a single, validated latent space spanning the three most widely used bulk expression atlases, the Flexynesis Tissue VAE reduces the need for ad hoc, dataset-specific embeddings and gives the bulk RNA-seq community a reusable representation analogous to those that foundation models have provided for single-cell data. Its transfer to pediatric tumors suggests utility beyond the adult-dominated training distribution, though performance on rarer tissues and on heavily perturbed disease states remains to be characterized more fully. As a recent preprint, its long-term adoption is still emerging, but the public checkpoint, permissive code license, and live demo position it for immediate practical use.

Key Features

Atlas-scale training: Learns from 118,263 bulk RNA-seq samples spanning TCGA, GTEx, and ARCHS4, harmonized to a common gene space and tissue ontology.

Compact latent space: Compresses 16,115 genes into a 121-dimensional representation, mapped to 42 UBERON tissue categories.

Supervised, denoising design: Tissue labels guide the encoder, and a denoising VAE variant improves robustness to technical noise across cohorts.

Strong tissue-of-origin accuracy: Reaches 94.9% balanced accuracy and a 96.2% weighted F1 on held-out tissue classification.

Independent validation: Generalizes to 734 pediatric TARGET tumor samples with 84.6% agreement against the expected tissue of origin.

Interactive and reusable: Ships with a web demo and a fixed checkpoint that loads without retraining.

Technical Details

Applications

Impact

Flexynesis Tissue VAE

Key Features

Technical Details

Applications

Impact

Citation

An atlas-scale generative model for unified representation learning of bulk RNA-seq data

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Flexynesis Tissue VAE

Key Features

Technical Details

Applications

Impact

Citation

An atlas-scale generative model for unified representation learning of bulk RNA-seq data

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Flexynesis Tissue VAE

#Key Features

#Technical Details

#Applications

#Impact

Citation

An atlas-scale generative model for unified representation learning of bulk RNA-seq data

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Flexynesis Tissue VAE

#Key Features

#Technical Details

#Applications

#Impact

Citation

An atlas-scale generative model for unified representation learning of bulk RNA-seq data

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact