bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
ImagingLanguage model

SigPhi-Med

Chongqing University of Technology

A 4.2B-parameter lightweight biomedical vision-language assistant built on Phi-2 that outperforms larger LLaVA-Med models on medical visual question answering.

Released: July 2025
Parameters: 4.2 Billion

SigPhi-Med is a lightweight biomedical multimodal small language model (MSLM) designed to answer questions about medical images while remaining small enough to deploy in resource-constrained healthcare settings. Medical vision-language assistants such as LLaVA-Med have demonstrated strong performance on tasks like medical visual question answering (VQA), but their 7B-to-13B-parameter backbones make them expensive to serve. SigPhi-Med targets this gap, showing that careful component selection and training design can let a much smaller model match or surpass these larger systems.

Developed by researchers at Chongqing University of Technology and published in the Journal of Biomedical Informatics in 2025, SigPhi-Med couples Microsoft's 2.7B-parameter Phi-2 language model with a vision encoder to form a roughly 4.2B total-parameter assistant. The work is accompanied by extensive ablation studies that isolate how the choice of small language model, the vision encoder and its input resolution, the training strategy, and the quality and quantity of training data each affect downstream performance.

Rather than introducing a fundamentally new architecture, SigPhi-Med contributes a recipe: it demonstrates which design decisions matter most when shrinking a biomedical VLM, and provides an openly released model and code as a reproducible baseline for efficient multimodal models in specialized domains.

#Key Features

  • Compact 4.2B-parameter backbone: Built on Microsoft's Phi-2 (2.7B) small language model, SigPhi-Med is far smaller than LLaVA-Med-v1.5 (7B) and LLaVA-Med (13B) while delivering competitive or better accuracy.
  • Strong medical VQA performance: Achieves superior overall results across the VQA-RAD, SLAKE, and Path-VQA benchmarks compared to larger biomedical assistants, including the Med-MoE (2.7B x 4) mixture-of-experts model.
  • Systematic ablation insights: Quantifies the contribution of the language model, vision encoder and resolution, training strategy, and training-data quality and quantity, offering actionable guidance for building efficient domain MSLMs.
  • TinyLLaVA-based, open implementation: Trained with the TinyLLaVA Factory framework and released with code on GitHub and weights on Hugging Face for reproducibility.

#Technical Details

SigPhi-Med follows the LLaVA-style design of a vision encoder connected to a language model through a projection module, but substitutes a compact Phi-2 language backbone to keep the total model near 4.2B parameters. Training uses the TinyLLaVA Factory framework with a two-stage paradigm of feature-alignment pretraining followed by instruction tuning. The training corpus draws on the biomedical LLaVA-Med dataset and the larger-scale PubMedVision dataset, with the paper analyzing how data quality and scale affect final accuracy. On the three standard medical VQA benchmarks - VQA-RAD (radiology), SLAKE (radiology), and Path-VQA (pathology) - SigPhi-Med reports higher overall performance than LLaVA-Med-v1.5 (7B), LLaVA-Med (13B), and Med-MoE despite its smaller size, with accompanying ablations attributing these gains to the choice of small language model and vision encoder resolution.

#Applications

SigPhi-Med is intended as a cost-effective medical image assistant for tasks such as answering open- and closed-form questions about radiology and pathology images. Its small footprint makes it attractive for deployment in resource-limited clinical or research environments where serving 7B-13B models is impractical, and for institutions exploring on-premises medical AI. Beyond direct use, the released model and ablation findings serve as a reproducible baseline and design reference for researchers building efficient multimodal assistants in biomedicine and other specialized domains.

#Impact

SigPhi-Med reinforces a broader trend toward smaller, efficient medical vision-language models by showing that a 4.2B-parameter assistant can outperform models several times its size on established VQA benchmarks. Its careful ablation study clarifies which components drive performance when compressing biomedical VLMs, providing practical guidance for the field. As an openly released model with public code and weights, it lowers the barrier to deploying and studying medical multimodal assistants. The main caveats are those common to medical VQA systems: benchmark accuracy does not guarantee clinical reliability, and the model inherits biases and coverage limits from its training datasets.

Citation

SigPhi-Med: A lightweight vision-language assistant for biomedicine

Zhou, F., et al. (2025) SigPhi-Med: A lightweight vision-language assistant for biomedicine. Journal of Biomedical Informatics.

DOI: 10.1016/j.jbi.2025.104849

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations8
Influential0
References14

GitHub

Stars5
Forks3
Open Issues0
Contributors1
Last Push6mo ago
LanguagePython

HuggingFace

Downloads3
Likes0
Last Modified1y ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
15Closed
Usability — can I run it?16
Reproducibility — can I retrain it?13
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

histologyinstruction_tuningmedical_image_understandingmultimodalradiologytransformervision_transformervisual_question_answering

Resources

GitHub RepositoryResearch PaperHuggingFace Model