HealthGPT

Zhejiang University / University of Electronic Science and Technology of China / Alibaba / Hong Kong University of Science and Technology / National University of Singapore

Medical vision-language model that unifies image comprehension and generation in one autoregressive transformer via heterogeneous LoRA adapters.

Released: February 2025

HealthGPT is a medical large vision-language model (Med-LVLM) that unifies two capabilities usually handled by separate systems: understanding medical images (comprehension) and producing them (generation). Most medical multimodal models specialize in one or the other — answering questions about a chest X-ray, or synthesizing an image — because comprehension and generation place conflicting demands on the same network. HealthGPT addresses this tension within a single autoregressive transformer, allowing one model to both reason about and generate medical imagery across modalities such as CT, MRI, X-ray, and microscopy.

The central idea is Heterogeneous Low-Rank Adaptation (H-LoRA), which decouples the knowledge required for comprehension from that required for generation into separate low-rank adapter "plugins" attached to a frozen pre-trained large language model. This prevents the two task families from interfering with each other during training while keeping the parameter footprint small. A hierarchical visual perception module and a three-stage learning strategy progressively inject heterogeneous knowledge into the base LLM.

Developed by researchers at Zhejiang University, the University of Electronic Science and Technology of China, Alibaba, the Hong Kong University of Science and Technology, and the National University of Singapore, HealthGPT was released in February 2025 and accepted as a Spotlight paper at ICML 2025. Weights and the supporting dataset are distributed under the Apache-2.0 license.

Key Features

Unified comprehension and generation: A single autoregressive model handles 7 types of medical comprehension tasks (e.g., visual question answering, report reasoning) and 5 types of generation tasks (e.g., modality conversion, super-resolution, reconstruction).
H-LoRA knowledge adaptation: Heterogeneous low-rank adapters isolate comprehension and generation knowledge into separate plugins, resolving the data-conflict problem that degrades jointly trained unified models.
Hierarchical visual perception: Visual features are organized across levels of abstraction so the model can serve both high-level semantic understanding and fine-grained pixel-level generation.
VL-Health training corpus: A purpose-built medical vision-language dataset spanning comprehension and generation samples across multiple imaging modalities, released alongside the model.
Open weights and multiple sizes: Released in Phi-3-mini (HealthGPT-M3) and Phi-4 (HealthGPT-L14) variants under Apache-2.0, with public code and datasets.

Technical Details

HealthGPT builds on a pre-trained large language model — HealthGPT-M3 uses Microsoft Phi-3-mini, while the larger HealthGPT-L14 uses Phi-4 — paired with a CLIP-derived vision encoder and a VQ-based visual tokenizer that lets the autoregressive backbone emit image tokens for generation. Rather than full fine-tuning, H-LoRA attaches task-specific low-rank adapters and routes comprehension versus generation through distinct adapter sets, trained via a three-stage curriculum that first aligns visual features, then specializes the heterogeneous plugins, and finally performs instruction tuning on the VL-Health corpus. Across medical multimodal comprehension benchmarks and generation tasks such as cross-modality synthesis and image super-resolution, HealthGPT reported performance competitive with or exceeding both unified vision models and medical-specialized baselines while using substantially fewer trainable parameters than full fine-tuning.

Applications

HealthGPT targets clinical and research workflows that mix interpretation and synthesis: answering questions about radiology and pathology images, drafting report-style descriptions, converting between imaging modalities (for example CT-to-MRI), enhancing image resolution, and reconstructing degraded scans. Because a single model covers both directions, it can streamline pipelines that would otherwise chain a comprehension model and a generative model together, benefiting radiologists, pathologists, and medical-imaging researchers exploring assistive diagnosis and data augmentation.

Impact

HealthGPT demonstrates that comprehension and generation can coexist in one medical foundation model without the destructive interference that has limited prior unified approaches, and it offers H-LoRA as a reusable recipe for adapting general LLMs to heterogeneous multimodal tasks. Its ICML 2025 Spotlight selection and fully open release — code, Apache-2.0 weights in multiple sizes, and the VL-Health dataset — have made it a reference point for unified medical vision-language modeling, since extended by a follow-up HealthGPT-Pro series adding text, 2D, and 3D volumetric support. As with all Med-LVLMs, generated images and comprehension outputs require expert validation before any clinical use.

Citation

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Preprint

Lin, T., et al. (2025) HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation. International Conference on Machine Learning.

DOI: 10.48550/arXiv.2502.09838

Recent citations

Papers that recently cited this model.

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy
Chunzheng Zhu, Lei Tian, Bohan Tan, et al.
Jul 2026
0
MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models
Hyunjae Kim, Dain Kim, Pan Xiao, et al.
Jul 2026
0
Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports
Bing Yan, Chunlei Li, Jingliang Hu, et al.
Jul 2026
0

Top citations

The most-cited papers that cite this model.

Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Yuxiang Lai, Jike Zhong, Ming Li, et al.
IEEE Transactions on Medical Imaging · Mar 2025
126
Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Yang Zhou, Sunzhu Li, Shunyu Liu, et al.
arXiv.org · Aug 2025
33
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity
Yuqian Yuan, Wenqiao Zhang, Xin Li, et al.
arXiv.org · Oct 2025
16
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?
Yuqian Yuan, Ronghao Dang, Long Li, et al.
arXiv.org · Jun 2025
16
Med-CMR: A Fine-Grained Benchmark Integrating Visual Evidence and Clinical Logic for Medical Complex Multimodal Reasoning
Haozhe Gong, Xiaozhong Ji, Yuansen Liu, et al.
arXiv.org · Nov 2025
12

Citations

Total Citations115

Influential11

References47

GitHub

Stars1.6K

Forks241

Open Issues12

Contributors5

Last Push2mo ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads38

Likes9

Last Modified1y ago

Pipelineany-to-any

Fields of citing research

Computer Science100%
Medicine79%
Engineering10%
Psychology2%
Law1%
Environmental Science1%
Biology1%
Linguistics1%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

68Partial

Usability — can I run it?94

Reproducibility — can I retrain it?32

open weights, closed recipe

Model Openness Framework

Unclassified

No formal model card / data card

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Unified comprehension and generation: A single autoregressive model handles 7 types of medical comprehension tasks (e.g., visual question answering, report reasoning) and 5 types of generation tasks (e.g., modality conversion, super-resolution, reconstruction).

H-LoRA knowledge adaptation: Heterogeneous low-rank adapters isolate comprehension and generation knowledge into separate plugins, resolving the data-conflict problem that degrades jointly trained unified models.

Hierarchical visual perception: Visual features are organized across levels of abstraction so the model can serve both high-level semantic understanding and fine-grained pixel-level generation.

VL-Health training corpus: A purpose-built medical vision-language dataset spanning comprehension and generation samples across multiple imaging modalities, released alongside the model.

Open weights and multiple sizes: Released in Phi-3-mini (HealthGPT-M3) and Phi-4 (HealthGPT-L14) variants under Apache-2.0, with public code and datasets.

Technical Details

Applications

Impact

Citation

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Preprint

Lin, T., et al. (2025) HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation. International Conference on Machine Learning.

DOI: 10.48550/arXiv.2502.09838

Recent citations

Papers that recently cited this model.

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

Chunzheng Zhu, Lei Tian, Bohan Tan, et al.

Jul 2026

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Hyunjae Kim, Dain Kim, Pan Xiao, et al.

Jul 2026

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

Bing Yan, Chunlei Li, Jingliang Hu, et al.

Jul 2026

HealthGPT

#Key Features

#Technical Details

#Applications

#Impact

Citation

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Recent citations

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

HealthGPT

#Key Features

#Technical Details

#Applications

#Impact

Citation

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Recent citations

The Path to Self-Evolving Clinical Systems: Scaling Medical Agents from Assistance to Autonomy

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact