Zhejiang University / University of Electronic Science and Technology of China / Alibaba / Hong Kong University of Science and Technology / National University of Singapore
Medical large vision-language model unifying image comprehension and generation in one autoregressive framework via heterogeneous LoRA knowledge adaptation.
HealthGPT is a medical large vision-language model (Med-LVLM) that unifies two capabilities usually handled by separate systems: understanding medical images (comprehension) and producing them (generation). Most medical multimodal models specialize in one or the other — answering questions about a chest X-ray, or synthesizing an image — because comprehension and generation place conflicting demands on the same network. HealthGPT addresses this tension within a single autoregressive transformer, allowing one model to both reason about and generate medical imagery across modalities such as CT, MRI, X-ray, and microscopy.
The central idea is Heterogeneous Low-Rank Adaptation (H-LoRA), which decouples the knowledge required for comprehension from that required for generation into separate low-rank adapter "plugins" attached to a frozen pre-trained large language model. This prevents the two task families from interfering with each other during training while keeping the parameter footprint small. A hierarchical visual perception module and a three-stage learning strategy progressively inject heterogeneous knowledge into the base LLM.
Developed by researchers at Zhejiang University, the University of Electronic Science and Technology of China, Alibaba, the Hong Kong University of Science and Technology, and the National University of Singapore, HealthGPT was released in February 2025 and accepted as a Spotlight paper at ICML 2025. Weights and the supporting dataset are distributed under the Apache-2.0 license.
HealthGPT builds on a pre-trained large language model — HealthGPT-M3 uses Microsoft Phi-3-mini, while the larger HealthGPT-L14 uses Phi-4 — paired with a CLIP-derived vision encoder and a VQ-based visual tokenizer that lets the autoregressive backbone emit image tokens for generation. Rather than full fine-tuning, H-LoRA attaches task-specific low-rank adapters and routes comprehension versus generation through distinct adapter sets, trained via a three-stage curriculum that first aligns visual features, then specializes the heterogeneous plugins, and finally performs instruction tuning on the VL-Health corpus. Across medical multimodal comprehension benchmarks and generation tasks such as cross-modality synthesis and image super-resolution, HealthGPT reported performance competitive with or exceeding both unified vision models and medical-specialized baselines while using substantially fewer trainable parameters than full fine-tuning.
HealthGPT targets clinical and research workflows that mix interpretation and synthesis: answering questions about radiology and pathology images, drafting report-style descriptions, converting between imaging modalities (for example CT-to-MRI), enhancing image resolution, and reconstructing degraded scans. Because a single model covers both directions, it can streamline pipelines that would otherwise chain a comprehension model and a generative model together, benefiting radiologists, pathologists, and medical-imaging researchers exploring assistive diagnosis and data augmentation.
HealthGPT demonstrates that comprehension and generation can coexist in one medical foundation model without the destructive interference that has limited prior unified approaches, and it offers H-LoRA as a reusable recipe for adapting general LLMs to heterogeneous multimodal tasks. Its ICML 2025 Spotlight selection and fully open release — code, Apache-2.0 weights in multiple sizes, and the VL-Health dataset — have made it a reference point for unified medical vision-language modeling, since extended by a follow-up HealthGPT-Pro series adding text, 2D, and 3D volumetric support. As with all Med-LVLMs, generated images and comprehension outputs require expert validation before any clinical use.
Lin, T., et al. (2025) HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation. International Conference on Machine Learning.
DOI: 10.48550/arXiv.2502.09838Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data