Shanghai AI Laboratory / Fudan University / Shanghai Jiao Tong University
Medical multimodal LLM (2B and 8B) trained for generalizable, step-by-step clinical reasoning via Mentor-Intern Collaborative Search.
Chiron-o1 is a family of medical multimodal large language models (MLLMs) built to perform deep, verifiable, step-by-step reasoning over clinical images and text, rather than producing single-shot answers to visual questions. It was introduced in June 2025 by researchers from Shanghai Artificial Intelligence Laboratory, Fudan University, and Shanghai Jiao Tong University, and the work was accepted to NeurIPS 2025.
The central problem the authors target is that most medical MLLMs answer visual questions directly, without an explicit reasoning trace, which limits both interpretability and the ability to generalize to complex clinical scenarios. High-quality medical chain-of-thought (CoT) supervision is scarce, and naive prompting of a single model tends to produce shallow or unreliable reasoning paths. Chiron-o1 addresses this with a data-generation strategy called Mentor-Intern Collaborative Search (MICS), which searches for effective reasoning paths by having strong "mentor" models propose reasoning steps and weaker "intern" models continue and stress-test them.
Released in 2B and 8B parameter sizes built on the InternVL3 backbone, Chiron-o1 reports state-of-the-art results across a range of medical visual question answering and reasoning benchmarks, positioning it among the open, reasoning-focused medical MLLMs alongside efforts such as HuatuoGPT-Vision, Med-R1, and MedVLM-R1.
Chiron-o1 fine-tunes the InternVL3 vision-language architecture (2B and 8B variants) using the MMRP dataset and a curriculum learning schedule. MICS generates training CoT by having mentor models seed reasoning steps while intern models (Qwen2.5-VL-7B, Qwen2-VL-7B, InternVL3-8B) continue them; the MICS-Score ranks candidate paths by how well interns can follow and complete them, favoring reasoning that is both correct and learnable. On benchmarks, the 8B model reports VQA-RAD 76.8%, SLAKE 83.2%, PathVQA 74.0%, PMC-VQA 57.5%, and MMMU Health & Medicine 54.6%, generally outperforming larger general-purpose and medical baselines such as HuatuoGPT-Vision-34B and Gemini-2.5-Pro on several tasks. On the held-out MMRP reasoning split it reaches 92.1% (pure text) and 58.4% (multimodal) accuracy.
Chiron-o1 targets medical visual question answering and reasoning across modalities including radiology, pathology, and general clinical imagery. Its explicit reasoning traces make it useful for research on interpretable clinical decision support, medical education and tutoring, and as a base for further fine-tuning. The compact 2B variant runs on modest GPUs (~8GB), lowering the barrier for academic groups and resource-constrained deployments to experiment with reasoning-capable medical MLLMs.
By reframing medical MLLM training around searched, ranked chain-of-thought data rather than direct-answer supervision, Chiron-o1 demonstrates that collaborative search over reasoning paths can yield more generalizable clinical reasoning. Its NeurIPS 2025 acceptance, openly released 2B and 8B weights, and MIT-licensed code make it a practical reference point for the growing class of reasoning-oriented medical foundation models. Key limitations include reliance on proprietary mentor models (GPT-4o, Gemini) to generate training data and licensing restrictions on parts of the underlying image corpus (e.g., Radiopaedia), which constrain full reproduction of the training dataset.
Sun, H., et al. (2025) Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search.
DOI: 10.48550/arXiv.2506.16962Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data