Hong Kong University of Science and Technology
A 40B-parameter generalist medical vision-language foundation model spanning radiology, pathology, dermatology, retinography, and endoscopy.
MedDr is a generalist medical vision-language foundation model designed to interpret images and answer clinical questions across a broad range of medical specialties from a single set of weights. Whereas most medical AI systems are trained narrowly for one modality or task—a chest X-ray classifier, a dermatology grader, a pathology tile detector—MedDr targets the harder problem of a unified model that reasons over radiology, pathology, dermatology, retinography, and endoscopy alike, performing visual question answering, report generation, and diagnostic classification within a conversational interface.
The model was developed by the SMART Lab at the Hong Kong University of Science and Technology, led by Hao Chen, and released as a preprint in April 2024 (arXiv:2404.15127). At the time of release the authors described it as the largest open-source generalist foundation model tailored for medicine. MedDr is the centerpiece of a broader framework called GSCo (Generalist–Specialist Collaboration), in which the generalist model is paired with lightweight task-specific specialist models at inference time to improve diagnostic accuracy.
Its central methodological contribution is "diagnosis-guided bootstrapping," a data-construction strategy that converts large repositories of labeled medical images into high-quality image–text instruction data, sidestepping the scarcity of paired image–report corpora that has historically bottlenecked medical vision-language training.
MedDr is built on the InternVL vision-language architecture (OpenGVLab/InternVL-Chat-V1-2), comprising a vision transformer image encoder coupled to a large language model decoder, with roughly 40 billion parameters total in the released BF16 checkpoint. Training proceeds by first constructing instruction data through diagnosis-guided bootstrapping—generating descriptive reports from medical images and their labels—then integrating these with existing medical vision-language tasks (VQA, captioning, classification) for instruction tuning. The GSCo evaluation spanned 28 datasets and roughly 250,000 images across the supported modalities, assessing report generation, visual question answering, and image-level diagnosis. The authors report that pairing the generalist with specialists and retrieval augmentation improves performance over the generalist alone, particularly on out-of-distribution diagnostic tasks.
MedDr is aimed at researchers building multimodal clinical assistants and at studies probing how far a single generalist model can go across heterogeneous medical imaging. Practical use cases include drafting preliminary radiology and pathology reports, answering image-grounded clinical questions, and serving as a flexible backbone that can be combined with specialist classifiers in a collaborative diagnostic pipeline. Because the weights are openly licensed, it is also a convenient starting point for fine-tuning on institution-specific datasets or new modalities.
MedDr contributed to a wave of open generalist medical vision-language models that challenged the prevailing one-model-per-task paradigm, and its diagnosis-guided bootstrapping offered a reusable recipe for turning abundant labeled-image archives into vision-language training data. The accompanying GSCo framework articulated a pragmatic middle path—rather than expecting a generalist to dominate every task, it formalized collaboration between broad and narrow models. As with all current medical foundation models, the work remains a research artifact: it is not a cleared clinical device, evaluations rest largely on retrospective benchmarks, and performance varies across modalities, so outputs require expert oversight before any clinical use.
He, S., et al. (2024) GSCo: Towards Generalizable AI in Medicine via Generalist-Specialist Collaboration.
DOI: 10.48550/arXiv.2404.15127Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data