bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
RNA foundation models
RNA

Flexynesis Tissue VAE

Max Delbrück Center for Molecular Medicine

A supervised variational autoencoder that learns a unified, tissue-aware latent representation of bulk RNA-seq, compressing 16,115 genes into 121 dimensions across 42 tissues.

Released: June 2026

The Flexynesis Tissue VAE is a supervised variational autoencoder that learns a unified, tissue-aware representation of bulk RNA-seq data. Despite the maturity of bulk transcriptomics, the field has lacked a single shared latent space that generalizes across the major public expression compendia. This model addresses that gap by training one encoder on harmonized data drawn from TCGA, GTEx, and ARCHS4, yielding compact embeddings that capture tissue identity while remaining robust to the technical heterogeneity that separates those resources.

Developed by Aakriti Pande, Bora Uyar, and Altuna Akalin at the Berlin Institute for Medical Systems Biology (BIMSB) of the Max Delbrück Center for Molecular Medicine, the work was posted to bioRxiv in June 2026. The model is presented in the paper "An atlas-scale generative model for unified representation learning of bulk RNA-seq data."

The name reflects the model's origin within the broader Flexynesis ecosystem, a deep-learning toolkit from the same lab for bulk multi-omics integration in precision oncology. The memory-safe HDF5 data loader built for this study was contributed upstream to Flexynesis. The tissue VAE itself, however, is a standalone, single-modality model focused on bulk RNA-seq representation, and is cataloged here as such.

#Key Features

  • Atlas-scale training: Learns from 118,263 bulk RNA-seq samples spanning TCGA, GTEx, and ARCHS4, harmonized to a common gene space and tissue ontology.
  • Compact latent space: Compresses 16,115 genes into a 121-dimensional representation, mapped to 42 UBERON tissue categories.
  • Supervised, denoising design: Tissue labels guide the encoder, and a denoising VAE variant improves robustness to technical noise across cohorts.
  • Strong tissue-of-origin accuracy: Reaches 94.9% balanced accuracy and a 96.2% weighted F1 on held-out tissue classification.
  • Independent validation: Generalizes to 734 pediatric TARGET tumor samples with 84.6% agreement against the expected tissue of origin.
  • Interactive and reusable: Ships with a web demo and a fixed checkpoint that loads without retraining.

#Technical Details

The architecture is a supervised variational autoencoder, with a denoising variant for added robustness, trained on an HDF5 compendium of 118,263 training and 28,274 test samples. Inputs are 16,115 genes shared across the source datasets; the encoder produces a 121-dimensional latent vector that is jointly optimized for reconstruction and for predicting one of 42 tissue categories. On held-out data the model attains 94.9% balanced accuracy and 96.2% weighted F1 for tissue-of-origin classification. External validation on 734 pediatric tumor samples from the TARGET project yields 84.6% agreement, demonstrating that the representation transfers to an out-of-distribution cohort distinct from the adult tissues that dominate training. The code is MIT-licensed; trained weights (vae_tissue.final_model.pth), the HDF5 compendium, and precomputed embeddings are released on Zenodo under CC-BY-4.0.

#Applications

The model provides ready-to-use embeddings for bulk RNA-seq, supporting tissue-of-origin prediction, sample quality control, and the placement of new samples within a shared reference space. Because the latent representation is compact and tissue-aware, it can serve as a feature extractor for downstream classifiers or as a screening tool for cancers of unknown primary, where identifying the likely originating tissue informs diagnosis. An interactive web application lets researchers embed their own expression profiles without configuring the training pipeline, lowering the barrier for wet-lab groups.

#Impact

By offering a single, validated latent space spanning the three most widely used bulk expression atlases, the Flexynesis Tissue VAE reduces the need for ad hoc, dataset-specific embeddings and gives the bulk RNA-seq community a reusable representation analogous to those that foundation models have provided for single-cell data. Its transfer to pediatric tumors suggests utility beyond the adult-dominated training distribution, though performance on rarer tissues and on heavily perturbed disease states remains to be characterized more fully. As a recent preprint, its long-term adoption is still emerging, but the public checkpoint, permissive code license, and live demo position it for immediate practical use.

Citation

An atlas-scale generative model for unified representation learning of bulk RNA-seq data

Pande, A., et al. (2026) An atlas-scale generative model for unified representation learning of bulk RNA-seq data. openRxiv.

DOI: 10.64898/2026.06.18.733198

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References25

GitHub

Stars1
Forks0
Open Issues0
Contributors2
Last Push6d ago
LanguagePython
LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
84Open
Usability — can I run it?86
Reproducibility — can I retrain it?95
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

gene_expressiongenerativerepresentation_learningsupervisedtissue_classificationtranscriptomicsvariational_autoencoder

Resources

GitHub RepositoryResearch PaperDemoDataset