SigPhi-Med

Biomedical vision-language assistant for medical visual question answering, pairing Phi-2 with a vision encoder in a 4.2B-parameter model.

Released: July 2025

Parameters: 4.2 Billion

SigPhi-Med is a lightweight biomedical multimodal small language model (MSLM) designed to answer questions about medical images while remaining small enough to deploy in resource-constrained healthcare settings. Medical vision-language assistants such as LLaVA-Med have demonstrated strong performance on tasks like medical visual question answering (VQA), but their 7B-to-13B-parameter backbones make them expensive to serve. SigPhi-Med targets this gap, showing that careful component selection and training design can let a much smaller model match or surpass these larger systems.

Developed by researchers at Chongqing University of Technology and published in the Journal of Biomedical Informatics in 2025, SigPhi-Med couples Microsoft's 2.7B-parameter Phi-2 language model with a vision encoder to form a roughly 4.2B total-parameter assistant. The work is accompanied by extensive ablation studies that isolate how the choice of small language model, the vision encoder and its input resolution, the training strategy, and the quality and quantity of training data each affect downstream performance.

Rather than introducing a fundamentally new architecture, SigPhi-Med contributes a recipe: it demonstrates which design decisions matter most when shrinking a biomedical VLM, and provides an openly released model and code as a reproducible baseline for efficient multimodal models in specialized domains.

Key Features

Compact 4.2B-parameter backbone: Built on Microsoft's Phi-2 (2.7B) small language model, SigPhi-Med is far smaller than LLaVA-Med-v1.5 (7B) and LLaVA-Med (13B) while delivering competitive or better accuracy.
Strong medical VQA performance: Achieves superior overall results across the VQA-RAD, SLAKE, and Path-VQA benchmarks compared to larger biomedical assistants, including the Med-MoE (2.7B x 4) mixture-of-experts model.
Systematic ablation insights: Quantifies the contribution of the language model, vision encoder and resolution, training strategy, and training-data quality and quantity, offering actionable guidance for building efficient domain MSLMs.
TinyLLaVA-based, open implementation: Trained with the TinyLLaVA Factory framework and released with code on GitHub and weights on Hugging Face for reproducibility.

Technical Details

SigPhi-Med follows the LLaVA-style design of a vision encoder connected to a language model through a projection module, but substitutes a compact Phi-2 language backbone to keep the total model near 4.2B parameters. Training uses the TinyLLaVA Factory framework with a two-stage paradigm of feature-alignment pretraining followed by instruction tuning. The training corpus draws on the biomedical LLaVA-Med dataset and the larger-scale PubMedVision dataset, with the paper analyzing how data quality and scale affect final accuracy. On the three standard medical VQA benchmarks - VQA-RAD (radiology), SLAKE (radiology), and Path-VQA (pathology) - SigPhi-Med reports higher overall performance than LLaVA-Med-v1.5 (7B), LLaVA-Med (13B), and Med-MoE despite its smaller size, with accompanying ablations attributing these gains to the choice of small language model and vision encoder resolution.

Applications

SigPhi-Med is intended as a cost-effective medical image assistant for tasks such as answering open- and closed-form questions about radiology and pathology images. Its small footprint makes it attractive for deployment in resource-limited clinical or research environments where serving 7B-13B models is impractical, and for institutions exploring on-premises medical AI. Beyond direct use, the released model and ablation findings serve as a reproducible baseline and design reference for researchers building efficient multimodal assistants in biomedicine and other specialized domains.

Impact

SigPhi-Med reinforces a broader trend toward smaller, efficient medical vision-language models by showing that a 4.2B-parameter assistant can outperform models several times its size on established VQA benchmarks. Its careful ablation study clarifies which components drive performance when compressing biomedical VLMs, providing practical guidance for the field. As an openly released model with public code and weights, it lowers the barrier to deploying and studying medical multimodal assistants. The main caveats are those common to medical VQA systems: benchmark accuracy does not guarantee clinical reliability, and the model inherits biases and coverage limits from its training datasets.

Citation

SigPhi-Med: A lightweight vision-language assistant for biomedicine

Zhou, F., et al. (2025) SigPhi-Med: A lightweight vision-language assistant for biomedicine. Journal of Biomedical Informatics.

DOI: 10.1016/j.jbi.2025.104849

Recent citations

Papers that recently cited this model.

Multimodal AI in healthcare: Review of vision-language foundation models for real-world medical applications.
Taha Razzaq, Murtaza Taj, Asim Iqbal
Journal of Biomedical Informatics · Jul 2026
0
Medical Multimodal Large Language Models: A Survey
Hanguang Xiao, Ningzhi Hui, Yong Xu, et al.
Information Fusion · Apr 2026
0
LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration
Gökçe İnal, Pouyan Navard, Alper Yilmaz
Mar 2026
0

Top citations

The most-cited papers that cite this model.

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine
Hanguang Xiao, Feizhong Zhou, X. Liu, et al.
Information Fusion · May 2024
116
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye, Hao Tang
arXiv.org · Apr 2025
23
Multimodal Large Language Models in Construction Education for Learning Human–Robot Collaboration: A Narrative Review
Ebenezer Olukanni, A. Akanmu, H. Jebelli
ASCE OPEN: Multidisciplinary Journal of Civil Engineering · 2026
3
Scaling down, Powering up: A Survey on the Advancements of Small Vision-Language Models
Sheikh Iftekhar Ahmed, Muhammad Zubair Hasan, Abrar Jahin Niloy, et al.
Information Fusion · Oct 2025
2
DeepSeek-R1 vs Open-Weight AI in Ophthalmology.
Zhenzhen Liu, Chengying Zhao, Haotian Lin
JAMA ophthalmology · Sep 2025
1

Citations

Total Citations9

Influential0

References14

GitHub

Stars5

Forks3

Open Issues0

Contributors1

Last Push8mo ago

LanguagePython

HuggingFace

Downloads4

Likes0

Last Modified1y ago

Fields of citing research

Computer Science89%
Medicine67%
Engineering22%
Environmental Science11%
Physics11%
Education11%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

15Closed

Usability — can I run it?16

Reproducibility — can I retrain it?13

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Compact 4.2B-parameter backbone: Built on Microsoft's Phi-2 (2.7B) small language model, SigPhi-Med is far smaller than LLaVA-Med-v1.5 (7B) and LLaVA-Med (13B) while delivering competitive or better accuracy.

Strong medical VQA performance: Achieves superior overall results across the VQA-RAD, SLAKE, and Path-VQA benchmarks compared to larger biomedical assistants, including the Med-MoE (2.7B x 4) mixture-of-experts model.

Systematic ablation insights: Quantifies the contribution of the language model, vision encoder and resolution, training strategy, and training-data quality and quantity, offering actionable guidance for building efficient domain MSLMs.

TinyLLaVA-based, open implementation: Trained with the TinyLLaVA Factory framework and released with code on GitHub and weights on Hugging Face for reproducibility.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Multimodal AI in healthcare: Review of vision-language foundation models for real-world medical applications.

Taha Razzaq, Murtaza Taj, Asim Iqbal

Journal of Biomedical Informatics · Jul 2026

Medical Multimodal Large Language Models: A Survey

Hanguang Xiao, Ningzhi Hui, Yong Xu, et al.

Information Fusion · Apr 2026

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Gökçe İnal, Pouyan Navard, Alper Yilmaz

Mar 2026

SigPhi-Med

#Key Features

#Technical Details

#Applications

#Impact

Citation

SigPhi-Med: A lightweight vision-language assistant for biomedicine

Recent citations

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

SigPhi-Med

#Key Features

#Technical Details

#Applications

#Impact

Citation

SigPhi-Med: A lightweight vision-language assistant for biomedicine

Recent citations

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact