Peking University / Macau University of Science and Technology / Sun Yat-sen University
A self-improving text-to-image diffusion foundation model that generates synthetic medical images across multiple modalities and organs to augment downstream clinical AI tasks.
MINIM (Medical Image-text geNeratIve Model) is a generative foundation model that synthesizes realistic medical images of multiple organs across several imaging modalities directly from free-text instructions. Rather than predicting structure or diagnosis from an existing scan, MINIM tackles the inverse problem: producing high-fidelity synthetic images on demand to expand scarce, privacy-constrained, or imbalanced medical imaging datasets. It was developed by Jinzhuo Wang and colleagues at Peking University, Macau University of Science and Technology, Sun Yat-sen University, and collaborating institutions, and published in Nature Medicine in February 2025.
Medical AI development is chronically bottlenecked by limited access to large, well-annotated, and demographically diverse image corpora. MINIM addresses this by acting as a single text-conditioned generator spanning optical coherence tomography (OCT), fundus photography, chest X-ray, chest CT, and brain MRI, with breast MRI added through transfer learning. A clinician can prompt it with a textual description of the desired anatomy and finding, and the model returns a corresponding synthetic image.
A defining feature is its self-improving training loop: after initial diffusion pretraining, the model is refined with reinforcement learning from radiologist feedback, progressively raising the realism and clinical plausibility of its outputs. The authors report that following this fine-tuning, 91% of MINIM-generated OCT images received the highest quality rating from clinicians.
MINIM is a latent text-to-image diffusion model built on a Stable Diffusion-style framework, using a U-Net denoiser with cross-attention to condition image generation on text. Modality labels and textual descriptions are concatenated and encoded with a BERT tokenizer to form the conditioning signal, and images are produced by iteratively reversing a learned Gaussian noising process. The training corpus pairs medical images with textual descriptions spanning the supported modalities and organs. After supervised diffusion training, a two-stage reinforcement-learning procedure incorporates radiologist feedback to align generations with expert judgments of clinical quality.
Image quality and utility were evaluated with both objective metrics — Fréchet Inception Distance (FID), Inception Score (IS), multi-scale structural similarity (MS-SSIM), classification accuracy score, and image-image / image-text retrieval — and blinded clinician review. On downstream classification, augmenting real data with MINIM-generated images raised EGFR-mutation prediction accuracy from lung CT from 81.5% to 95.4% (at a 5:1 synthetic-to-real ratio) and HER2-status prediction from breast MRI from 79.2% to 94.0%.
MINIM is intended for medical-AI researchers and clinical informaticians who need to enlarge or rebalance training datasets without collecting and de-identifying additional patient scans. Synthetic images can augment diagnostic classifiers, seed self-supervised pretraining, and support automated radiology report generation. The reported HER2 and EGFR use cases illustrate how synthetic augmentation can sharpen biomarker and mutation prediction from routine imaging, which is relevant to precision-oncology workflows where labeled cases are scarce. The released code allows researchers to reproduce results and adapt the generator to new modalities via transfer learning.
MINIM demonstrates that a single text-conditioned generative model, refined with expert reinforcement signals, can produce synthetic medical images useful enough to materially improve a range of downstream clinical tasks. By framing data scarcity as a generation problem and showing consistent double-digit performance gains across organs and modalities, it strengthens the case for synthetic data as a practical lever in medical AI. Limitations remain: generated images can encode artifacts or biases from the training distribution, synthetic augmentation must be validated against real held-out data before clinical use, and the public release distributes weights via a third-party file host rather than a versioned model hub, with no formal model card or datasheet accompanying the code. As with all generative medical imaging, outputs require expert oversight and the model is not intended for clinical decision-making without further validation.
Wang, J., et al. (2024) Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nature Medicine.
DOI: 10.1038/s41591-024-03359-yPapers that recently cited this model.
The most-cited papers that cite this model.
Not enough data