A generative diffusion-transformer foundation model that embeds H&E histology, RNA profiles, and clinical text in a shared latent space for zero-shot cross-modal synthesis.
MuPD (Multimodal Pathology Diffusion) is a generative foundation model for computational pathology that embeds hematoxylin and eosin (H&E)-stained histology, molecular RNA profiles, and clinical text into a shared latent space through a diffusion transformer. Rather than treating each modality in isolation, MuPD learns the joint distribution across them, enabling generation of one modality conditioned on any combination of the others — including cases where measurements are missing or expensive to acquire.
The model addresses a persistent obstacle in multimodal medical data: real-world pathology datasets are frequently incomplete, with morphology, transcriptomics, and annotations rarely all available for the same sample. By unifying these modalities in a single generative framework, MuPD can synthesize realistic histology from text prompts or RNA profiles, perform virtual staining, and augment scarce datasets with biologically plausible samples. This positions it alongside generative pathology models such as STMDiT while extending conditioning to text and molecular signals at foundation-model scale.
MuPD was developed by the Ruijiang Li lab at Stanford University, with first author Jinxi Xiang, and released as an arXiv preprint in April 2026. It is a companion to STORM, a representation-learning foundation model from the same group, with MuPD focusing on the generative side of the spatial-transcriptomics-and-histology problem.
MuPD is a diffusion transformer pretrained on a large multimodal corpus spanning 34 human organs: approximately 100 million H&E histology image patches, 1.6 million text-histology pairs, and 10.8 million RNA-histology pairs. The three modalities are projected into a shared latent space, over which the diffusion process is learned, enabling flexible conditional generation. On generation benchmarks, the authors report a 50% reduction in Fréchet Inception Distance (FID) for text-conditioned and image-to-image generation versus specialized single-task models, and a 23% FID reduction for RNA-conditioned histology generation. Downstream, synthetic augmentation improves few-shot classification by 47%, and virtual staining improves marker correlation by 37%.
MuPD supports computational pathology and spatial biology workflows where multimodal data is incomplete. Researchers can generate synthetic paired histology-transcriptomics-text data for augmentation, perform virtual staining to predict molecular markers from morphology, and prototype models on rare disease cohorts where labeled examples are scarce. The text-conditioning pathway allows histology synthesis directly from descriptive prompts, useful for exploratory studies and benchmark construction, while RNA-conditioned generation links transcriptional state to tissue appearance.
MuPD demonstrates that a single diffusion-transformer foundation model can unify histology, transcriptomics, and clinical text for generative tasks across dozens of organs, extending generative pathology beyond unconditional or label-conditioned image synthesis toward fully multimodal, cross-modal generation. Reported improvements in FID, few-shot classification, and virtual staining suggest practical value for data augmentation and modality imputation in settings where complete multimodal measurements are unavailable. As a recently released arXiv preprint, its claims await peer review and independent validation, and code and model weights were not yet public at release, but it points toward generative tools that fill gaps in real-world pathology datasets.