bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Pathology foundation models
PathologyLanguage model

HuatuoGPT-Vision

Shenzhen Research Institute of Big Data / Chinese University of Hong Kong, Shenzhen

A family of open medical multimodal LLMs (7B and 34B) trained on PubMedVision, a 1.3M-sample medical VQA dataset distilled from PubMed image-text pairs.

Released: June 2024

HuatuoGPT-Vision is a family of open medical multimodal large language models (MLLMs) designed to understand and reason over biomedical images such as radiographs, CT and MRI scans, ultrasound, pathology slides, and endoscopy alongside natural-language questions. It was introduced in June 2024 by the FreedomIntelligence group at the Shenzhen Research Institute of Big Data and the Chinese University of Hong Kong, Shenzhen, in the paper "Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale."

The central problem the work addresses is data scarcity: general-purpose MLLMs perform poorly on medical visual tasks because high-quality, instruction-formatted medical image-text data is rare. While PubMed contains millions of figures paired with captions, that text is noisy, often describing only fragments of compound figures or referencing context outside the image. The authors' key contribution is PubMedVision, a curated dataset of roughly 1.29 million medical visual-question-answering (VQA) samples built by extracting PubMed image-text pairs and using GPT-4V to denoise and reformat them into aligned question-answer conversations.

Training on PubMedVision yields HuatuoGPT-Vision models that substantially improve medical visual reasoning over their general-purpose backbones, while the dataset itself is released for the community to reuse. The project positions itself as the medical-vision counterpart to the text-only HuatuoGPT medical assistant line.

#Key Features

  • PubMedVision dataset: A 1.29M-sample medical VQA corpus (alignment and instruction-tuning splits, plus a Chinese subset) spanning CT, MRI, X-ray, ultrasound, microscopy, and endoscopy across many anatomical regions.
  • GPT-4V reformatting pipeline: Rather than using raw captions, the authors prompt GPT-4V to convert noisy PubMed image-text pairs into clean, image-grounded question-answer dialogues, improving data quality and alignment.
  • Two open model sizes: A 7B model built on Qwen2-7B and a 34B model built on Yi-1.5-34B, both released with weights under a permissive license for research use.
  • Strong open-source medical VQA performance: The 34B model leads open-source alternatives on standard benchmarks and is competitive on the MMMU Health & Medicine track.
  • Multilingual coverage: A Chinese-language subset extends medical visual reasoning beyond English-only data.

#Technical Details

HuatuoGPT-Vision follows the LLaVA-style architecture: a CLIP-based vision encoder feeds image embeddings through a projection layer into a decoder-only LLM. The 7B variant uses Qwen2-7B as its language backbone and the 34B variant uses Yi-1.5-34B. Models are trained in two stages on PubMedVision, first aligning vision and language representations on the alignment split and then instruction-tuning on the conversation split. On medical VQA benchmarks the 7B model reports VQA-RAD 63.7%, SLAKE 76.2%, PathVQA 57.9%, and PMC-VQA 54.3%, while the 34B model improves these to 68.1%, 76.9%, 63.5%, and 58.2% respectively, alongside gains on OmniMedVQA and the MMMU medical track. The authors report that adding PubMedVision to training data consistently lifts performance across multiple base MLLMs, indicating the dataset, not just the released checkpoints, drives the improvement.

#Applications

HuatuoGPT-Vision is intended as a research tool for medical multimodal AI: answering questions about radiology and pathology images, generating image-grounded descriptions, and serving as a stronger initialization for downstream clinical-vision tasks. The accompanying PubMedVision dataset is widely reusable for training or fine-tuning other medical MLLMs. The models are research artifacts and are not validated or approved for clinical decision-making; the authors caution against direct diagnostic use.

#Impact

By open-sourcing both the 1.3M-sample PubMedVision dataset and the trained 7B and 34B checkpoints under Apache 2.0, the project lowered the barrier to building medical vision-language systems and provided a reusable data recipe for converting noisy biomedical figures into instruction data. PubMedVision has been adopted as a training resource by subsequent medical MLLM efforts, and HuatuoGPT-Vision serves as a frequently cited open baseline for medical VQA. Its main limitations are those of its sources: PubMed figures skew toward published, illustrative cases, and GPT-4V reformatting can propagate model errors into the data.

Citation

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Preprint

Chen, J., et al. (2024) HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale. arXiv.org.

DOI: 10.48550/arXiv.2406.19280

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations187
Influential36
References34

GitHub

Stars399
Forks35
Open Issues24
Contributors2
Last Push1y ago
LanguagePython

HuggingFace

Downloads1.9K
Likes28
Last Modified1y ago
Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
52Partial
Usability — can I run it?54
Reproducibility — can I retrain it?41
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

histologyinstruction_tuningmedical_image_understandingmultimodalradiologytransformervision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset