Chongqing University of Technology
A 4.2B-parameter lightweight biomedical vision-language assistant built on Phi-2 that outperforms larger LLaVA-Med models on medical visual question answering.
SigPhi-Med is a lightweight biomedical multimodal small language model (MSLM) designed to answer questions about medical images while remaining small enough to deploy in resource-constrained healthcare settings. Medical vision-language assistants such as LLaVA-Med have demonstrated strong performance on tasks like medical visual question answering (VQA), but their 7B-to-13B-parameter backbones make them expensive to serve. SigPhi-Med targets this gap, showing that careful component selection and training design can let a much smaller model match or surpass these larger systems.
Developed by researchers at Chongqing University of Technology and published in the Journal of Biomedical Informatics in 2025, SigPhi-Med couples Microsoft's 2.7B-parameter Phi-2 language model with a vision encoder to form a roughly 4.2B total-parameter assistant. The work is accompanied by extensive ablation studies that isolate how the choice of small language model, the vision encoder and its input resolution, the training strategy, and the quality and quantity of training data each affect downstream performance.
Rather than introducing a fundamentally new architecture, SigPhi-Med contributes a recipe: it demonstrates which design decisions matter most when shrinking a biomedical VLM, and provides an openly released model and code as a reproducible baseline for efficient multimodal models in specialized domains.
SigPhi-Med follows the LLaVA-style design of a vision encoder connected to a language model through a projection module, but substitutes a compact Phi-2 language backbone to keep the total model near 4.2B parameters. Training uses the TinyLLaVA Factory framework with a two-stage paradigm of feature-alignment pretraining followed by instruction tuning. The training corpus draws on the biomedical LLaVA-Med dataset and the larger-scale PubMedVision dataset, with the paper analyzing how data quality and scale affect final accuracy. On the three standard medical VQA benchmarks - VQA-RAD (radiology), SLAKE (radiology), and Path-VQA (pathology) - SigPhi-Med reports higher overall performance than LLaVA-Med-v1.5 (7B), LLaVA-Med (13B), and Med-MoE despite its smaller size, with accompanying ablations attributing these gains to the choice of small language model and vision encoder resolution.
SigPhi-Med is intended as a cost-effective medical image assistant for tasks such as answering open- and closed-form questions about radiology and pathology images. Its small footprint makes it attractive for deployment in resource-limited clinical or research environments where serving 7B-13B models is impractical, and for institutions exploring on-premises medical AI. Beyond direct use, the released model and ablation findings serve as a reproducible baseline and design reference for researchers building efficient multimodal assistants in biomedicine and other specialized domains.
SigPhi-Med reinforces a broader trend toward smaller, efficient medical vision-language models by showing that a 4.2B-parameter assistant can outperform models several times its size on established VQA benchmarks. Its careful ablation study clarifies which components drive performance when compressing biomedical VLMs, providing practical guidance for the field. As an openly released model with public code and weights, it lowers the barrier to deploying and studying medical multimodal assistants. The main caveats are those common to medical VQA systems: benchmark accuracy does not guarantee clinical reliability, and the model inherits biases and coverage limits from its training datasets.
Zhou, F., et al. (2025) SigPhi-Med: A lightweight vision-language assistant for biomedicine. Journal of Biomedical Informatics.
DOI: 10.1016/j.jbi.2025.104849Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data