HuatuoGPT-Vision

Shenzhen Research Institute of Big Data / Chinese University of Hong Kong, Shenzhen

Open medical multimodal LLMs (7B and 34B) for visual question answering over radiology, pathology, and endoscopy images, trained on PubMedVision.

Released: June 2024

HuatuoGPT-Vision is a family of open medical multimodal large language models (MLLMs) designed to understand and reason over biomedical images such as radiographs, CT and MRI scans, ultrasound, pathology slides, and endoscopy alongside natural-language questions. It was introduced in June 2024 by the FreedomIntelligence group at the Shenzhen Research Institute of Big Data and the Chinese University of Hong Kong, Shenzhen, in the paper "Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale."

The central problem the work addresses is data scarcity: general-purpose MLLMs perform poorly on medical visual tasks because high-quality, instruction-formatted medical image-text data is rare. While PubMed contains millions of figures paired with captions, that text is noisy, often describing only fragments of compound figures or referencing context outside the image. The authors' key contribution is PubMedVision, a curated dataset of roughly 1.29 million medical visual-question-answering (VQA) samples built by extracting PubMed image-text pairs and using GPT-4V to denoise and reformat them into aligned question-answer conversations.

Training on PubMedVision yields HuatuoGPT-Vision models that substantially improve medical visual reasoning over their general-purpose backbones, while the dataset itself is released for the community to reuse. The project positions itself as the medical-vision counterpart to the text-only HuatuoGPT medical assistant line.

Key Features

PubMedVision dataset: A 1.29M-sample medical VQA corpus (alignment and instruction-tuning splits, plus a Chinese subset) spanning CT, MRI, X-ray, ultrasound, microscopy, and endoscopy across many anatomical regions.
GPT-4V reformatting pipeline: Rather than using raw captions, the authors prompt GPT-4V to convert noisy PubMed image-text pairs into clean, image-grounded question-answer dialogues, improving data quality and alignment.
Two open model sizes: A 7B model built on Qwen2-7B and a 34B model built on Yi-1.5-34B, both released with weights under a permissive license for research use.
Strong open-source medical VQA performance: The 34B model leads open-source alternatives on standard benchmarks and is competitive on the MMMU Health & Medicine track.
Multilingual coverage: A Chinese-language subset extends medical visual reasoning beyond English-only data.

Technical Details

HuatuoGPT-Vision follows the LLaVA-style architecture: a CLIP-based vision encoder feeds image embeddings through a projection layer into a decoder-only LLM. The 7B variant uses Qwen2-7B as its language backbone and the 34B variant uses Yi-1.5-34B. Models are trained in two stages on PubMedVision, first aligning vision and language representations on the alignment split and then instruction-tuning on the conversation split. On medical VQA benchmarks the 7B model reports VQA-RAD 63.7%, SLAKE 76.2%, PathVQA 57.9%, and PMC-VQA 54.3%, while the 34B model improves these to 68.1%, 76.9%, 63.5%, and 58.2% respectively, alongside gains on OmniMedVQA and the MMMU medical track. The authors report that adding PubMedVision to training data consistently lifts performance across multiple base MLLMs, indicating the dataset, not just the released checkpoints, drives the improvement.

Applications

HuatuoGPT-Vision is intended as a research tool for medical multimodal AI: answering questions about radiology and pathology images, generating image-grounded descriptions, and serving as a stronger initialization for downstream clinical-vision tasks. The accompanying PubMedVision dataset is widely reusable for training or fine-tuning other medical MLLMs. The models are research artifacts and are not validated or approved for clinical decision-making; the authors caution against direct diagnostic use.

Impact

By open-sourcing both the 1.3M-sample PubMedVision dataset and the trained 7B and 34B checkpoints under Apache 2.0, the project lowered the barrier to building medical vision-language systems and provided a reusable data recipe for converting noisy biomedical figures into instruction data. PubMedVision has been adopted as a training resource by subsequent medical MLLM efforts, and HuatuoGPT-Vision serves as a frequently cited open baseline for medical VQA. Its main limitations are those of its sources: PubMed figures skew toward published, illustrative cases, and GPT-4V reformatting can propagate model errors into the data.

Citation

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Preprint

Chen, J., et al. (2024) HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale. arXiv.org.

DOI: 10.48550/arXiv.2406.19280

Recent citations

Papers that recently cited this model.

Evaluating and Understanding Model Editing for Medical Vision Language Models
Guli Zhu, Chenwei Wu, Liyue Shen
Jul 2026
0
IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation
Hao Wei, Wenjin Qi, Dasen Dai, et al.
Jul 2026
0
Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports
Bing Yan, Chunlei Li, Jingliang Hu, et al.
Jul 2026
0

Top citations

The most-cited papers that cite this model.

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Junying Chen, Zhenyang Cai, Ke Ji, et al.
arXiv.org · Dec 2024
238
Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, et al.
Nature Communications · Aug 2025
228
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Jiazhen Pan, Che Liu, Junde Wu, et al.
International Conference on Medical Image Computing and Computer-Assisted Intervention · Feb 2025
171
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models
Yuxiang Lai, Jike Zhong, Ming Li, et al.
IEEE Transactions on Medical Imaging · Mar 2025
126
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
Peng Xia, Peng Xia, Kangyu Zhu, et al.
International Conference on Learning Representations · Oct 2024
114

Citations

Total Citations204

Influential39

References34

GitHub

Stars399

Forks36

Open Issues24

Contributors2

Last Push1y ago

LanguagePython

HuggingFace

Downloads2.9K

Likes30

Last Modified2y ago

Pipelinetext-generation

Fields of citing research

Computer Science98%
Medicine93%
Engineering12%
Psychology1%
Linguistics1%
Biology1%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

52Partial

Usability — can I run it?54

Reproducibility — can I retrain it?41

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

PubMedVision dataset: A 1.29M-sample medical VQA corpus (alignment and instruction-tuning splits, plus a Chinese subset) spanning CT, MRI, X-ray, ultrasound, microscopy, and endoscopy across many anatomical regions.

GPT-4V reformatting pipeline: Rather than using raw captions, the authors prompt GPT-4V to convert noisy PubMed image-text pairs into clean, image-grounded question-answer dialogues, improving data quality and alignment.

Two open model sizes: A 7B model built on Qwen2-7B and a 34B model built on Yi-1.5-34B, both released with weights under a permissive license for research use.

Strong open-source medical VQA performance: The 34B model leads open-source alternatives on standard benchmarks and is competitive on the MMMU Health & Medicine track.

Multilingual coverage: A Chinese-language subset extends medical visual reasoning beyond English-only data.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Evaluating and Understanding Model Editing for Medical Vision Language Models

Guli Zhu, Chenwei Wu, Liyue Shen

Jul 2026

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Hao Wei, Wenjin Qi, Dasen Dai, et al.

Jul 2026

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

Bing Yan, Chunlei Li, Jingliang Hu, et al.

Jul 2026

HuatuoGPT-Vision

#Key Features

#Technical Details

#Applications

#Impact

Citation

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Recent citations

Evaluating and Understanding Model Editing for Medical Vision Language Models

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

HuatuoGPT-Vision

#Key Features

#Technical Details

#Applications

#Impact

Citation

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Recent citations

Evaluating and Understanding Model Editing for Medical Vision Language Models

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact