Shenzhen Research Institute of Big Data / Chinese University of Hong Kong, Shenzhen
A family of open medical multimodal LLMs (7B and 34B) trained on PubMedVision, a 1.3M-sample medical VQA dataset distilled from PubMed image-text pairs.
HuatuoGPT-Vision is a family of open medical multimodal large language models (MLLMs) designed to understand and reason over biomedical images such as radiographs, CT and MRI scans, ultrasound, pathology slides, and endoscopy alongside natural-language questions. It was introduced in June 2024 by the FreedomIntelligence group at the Shenzhen Research Institute of Big Data and the Chinese University of Hong Kong, Shenzhen, in the paper "Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale."
The central problem the work addresses is data scarcity: general-purpose MLLMs perform poorly on medical visual tasks because high-quality, instruction-formatted medical image-text data is rare. While PubMed contains millions of figures paired with captions, that text is noisy, often describing only fragments of compound figures or referencing context outside the image. The authors' key contribution is PubMedVision, a curated dataset of roughly 1.29 million medical visual-question-answering (VQA) samples built by extracting PubMed image-text pairs and using GPT-4V to denoise and reformat them into aligned question-answer conversations.
Training on PubMedVision yields HuatuoGPT-Vision models that substantially improve medical visual reasoning over their general-purpose backbones, while the dataset itself is released for the community to reuse. The project positions itself as the medical-vision counterpart to the text-only HuatuoGPT medical assistant line.
HuatuoGPT-Vision follows the LLaVA-style architecture: a CLIP-based vision encoder feeds image embeddings through a projection layer into a decoder-only LLM. The 7B variant uses Qwen2-7B as its language backbone and the 34B variant uses Yi-1.5-34B. Models are trained in two stages on PubMedVision, first aligning vision and language representations on the alignment split and then instruction-tuning on the conversation split. On medical VQA benchmarks the 7B model reports VQA-RAD 63.7%, SLAKE 76.2%, PathVQA 57.9%, and PMC-VQA 54.3%, while the 34B model improves these to 68.1%, 76.9%, 63.5%, and 58.2% respectively, alongside gains on OmniMedVQA and the MMMU medical track. The authors report that adding PubMedVision to training data consistently lifts performance across multiple base MLLMs, indicating the dataset, not just the released checkpoints, drives the improvement.
HuatuoGPT-Vision is intended as a research tool for medical multimodal AI: answering questions about radiology and pathology images, generating image-grounded descriptions, and serving as a stronger initialization for downstream clinical-vision tasks. The accompanying PubMedVision dataset is widely reusable for training or fine-tuning other medical MLLMs. The models are research artifacts and are not validated or approved for clinical decision-making; the authors caution against direct diagnostic use.
By open-sourcing both the 1.3M-sample PubMedVision dataset and the trained 7B and 34B checkpoints under Apache 2.0, the project lowered the barrier to building medical vision-language systems and provided a reusable data recipe for converting noisy biomedical figures into instruction data. PubMedVision has been adopted as a training resource by subsequent medical MLLM efforts, and HuatuoGPT-Vision serves as a frequently cited open baseline for medical VQA. Its main limitations are those of its sources: PubMed figures skew toward published, illustrative cases, and GPT-4V reformatting can propagate model errors into the data.
Chen, J., et al. (2024) HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale. arXiv.org.
DOI: 10.48550/arXiv.2406.19280Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data