bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Pathology foundation models
PathologyImaging

HealthGPT

Zhejiang University / University of Electronic Science and Technology of China / Alibaba / Hong Kong University of Science and Technology / National University of Singapore

Medical large vision-language model unifying image comprehension and generation in one autoregressive framework via heterogeneous LoRA knowledge adaptation.

Released: February 2025

HealthGPT is a medical large vision-language model (Med-LVLM) that unifies two capabilities usually handled by separate systems: understanding medical images (comprehension) and producing them (generation). Most medical multimodal models specialize in one or the other — answering questions about a chest X-ray, or synthesizing an image — because comprehension and generation place conflicting demands on the same network. HealthGPT addresses this tension within a single autoregressive transformer, allowing one model to both reason about and generate medical imagery across modalities such as CT, MRI, X-ray, and microscopy.

The central idea is Heterogeneous Low-Rank Adaptation (H-LoRA), which decouples the knowledge required for comprehension from that required for generation into separate low-rank adapter "plugins" attached to a frozen pre-trained large language model. This prevents the two task families from interfering with each other during training while keeping the parameter footprint small. A hierarchical visual perception module and a three-stage learning strategy progressively inject heterogeneous knowledge into the base LLM.

Developed by researchers at Zhejiang University, the University of Electronic Science and Technology of China, Alibaba, the Hong Kong University of Science and Technology, and the National University of Singapore, HealthGPT was released in February 2025 and accepted as a Spotlight paper at ICML 2025. Weights and the supporting dataset are distributed under the Apache-2.0 license.

#Key Features

  • Unified comprehension and generation: A single autoregressive model handles 7 types of medical comprehension tasks (e.g., visual question answering, report reasoning) and 5 types of generation tasks (e.g., modality conversion, super-resolution, reconstruction).
  • H-LoRA knowledge adaptation: Heterogeneous low-rank adapters isolate comprehension and generation knowledge into separate plugins, resolving the data-conflict problem that degrades jointly trained unified models.
  • Hierarchical visual perception: Visual features are organized across levels of abstraction so the model can serve both high-level semantic understanding and fine-grained pixel-level generation.
  • VL-Health training corpus: A purpose-built medical vision-language dataset spanning comprehension and generation samples across multiple imaging modalities, released alongside the model.
  • Open weights and multiple sizes: Released in Phi-3-mini (HealthGPT-M3) and Phi-4 (HealthGPT-L14) variants under Apache-2.0, with public code and datasets.

#Technical Details

HealthGPT builds on a pre-trained large language model — HealthGPT-M3 uses Microsoft Phi-3-mini, while the larger HealthGPT-L14 uses Phi-4 — paired with a CLIP-derived vision encoder and a VQ-based visual tokenizer that lets the autoregressive backbone emit image tokens for generation. Rather than full fine-tuning, H-LoRA attaches task-specific low-rank adapters and routes comprehension versus generation through distinct adapter sets, trained via a three-stage curriculum that first aligns visual features, then specializes the heterogeneous plugins, and finally performs instruction tuning on the VL-Health corpus. Across medical multimodal comprehension benchmarks and generation tasks such as cross-modality synthesis and image super-resolution, HealthGPT reported performance competitive with or exceeding both unified vision models and medical-specialized baselines while using substantially fewer trainable parameters than full fine-tuning.

#Applications

HealthGPT targets clinical and research workflows that mix interpretation and synthesis: answering questions about radiology and pathology images, drafting report-style descriptions, converting between imaging modalities (for example CT-to-MRI), enhancing image resolution, and reconstructing degraded scans. Because a single model covers both directions, it can streamline pipelines that would otherwise chain a comprehension model and a generative model together, benefiting radiologists, pathologists, and medical-imaging researchers exploring assistive diagnosis and data augmentation.

#Impact

HealthGPT demonstrates that comprehension and generation can coexist in one medical foundation model without the destructive interference that has limited prior unified approaches, and it offers H-LoRA as a reusable recipe for adapting general LLMs to heterogeneous multimodal tasks. Its ICML 2025 Spotlight selection and fully open release — code, Apache-2.0 weights in multiple sizes, and the VL-Health dataset — have made it a reference point for unified medical vision-language modeling, since extended by a follow-up HealthGPT-Pro series adding text, 2D, and 3D volumetric support. As with all Med-LVLMs, generated images and comprehension outputs require expert validation before any clinical use.

Citation

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Preprint

Lin, T., et al. (2025) HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation. International Conference on Machine Learning.

DOI: 10.48550/arXiv.2502.09838

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations99
Influential9
References47

GitHub

Stars1.6K
Forks239
Open Issues12
Contributors5
Last Push1mo ago
LanguagePython
LicenseApache-2.0

HuggingFace

Downloads21
Likes9
Last Modified1y ago
Pipelineany-to-any

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
68Partial
Usability — can I run it?94
Reproducibility — can I retrain it?32
open weights, closed recipe
Model Openness Framework
Unclassified
No formal model card / data card

Tags

histologyimage_reconstructionmedical_image_generationmultimodalparameter_efficient_fine_tuningradiologytransformervision_language_modelvision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset