Chiron-o1

Shanghai AI Laboratory / Fudan University / Shanghai Jiao Tong University

Medical multimodal LLM (2B and 8B) trained for generalizable, step-by-step clinical reasoning via Mentor-Intern Collaborative Search.

Released: June 2025

Chiron-o1 is a family of medical multimodal large language models (MLLMs) built to perform deep, verifiable, step-by-step reasoning over clinical images and text, rather than producing single-shot answers to visual questions. It was introduced in June 2025 by researchers from Shanghai Artificial Intelligence Laboratory, Fudan University, and Shanghai Jiao Tong University, and the work was accepted to NeurIPS 2025.

The central problem the authors target is that most medical MLLMs answer visual questions directly, without an explicit reasoning trace, which limits both interpretability and the ability to generalize to complex clinical scenarios. High-quality medical chain-of-thought (CoT) supervision is scarce, and naive prompting of a single model tends to produce shallow or unreliable reasoning paths. Chiron-o1 addresses this with a data-generation strategy called Mentor-Intern Collaborative Search (MICS), which searches for effective reasoning paths by having strong "mentor" models propose reasoning steps and weaker "intern" models continue and stress-test them.

Released in 2B and 8B parameter sizes built on the InternVL3 backbone, Chiron-o1 reports state-of-the-art results across a range of medical visual question answering and reasoning benchmarks, positioning it among the open, reasoning-focused medical MLLMs alongside efforts such as HuatuoGPT-Vision, Med-R1, and MedVLM-R1.

Key Features

Mentor-Intern Collaborative Search (MICS): A reasoning-path search scheme where mentor models (GPT-4o, Gemini 2.5 Pro, Qwen2.5-VL-72B) initialize steps and intern models continue them, selecting paths by an MICS-Score that measures how learnable a reasoning path is.
MMRP reasoning dataset: A ranked multimodal medical reasoning dataset combining simple QA pairs, image-text alignment annotations, and MICS-generated multimodal chain-of-thought data for complex cases.
Curriculum learning: Training proceeds from simpler alignment and QA tasks toward harder multimodal CoT, progressively building generalizable reasoning ability.
Two open sizes: Chiron-o1-2B (InternVL3-2B, ~8GB GPU) and Chiron-o1-8B (InternVL3-8B, ~19GB GPU) with released weights under an MIT-licensed codebase.
Verifiable reasoning traces: Outputs include explicit step-by-step chains rather than opaque single answers, aiding interpretability in clinical contexts.

Technical Details

Chiron-o1 fine-tunes the InternVL3 vision-language architecture (2B and 8B variants) using the MMRP dataset and a curriculum learning schedule. MICS generates training CoT by having mentor models seed reasoning steps while intern models (Qwen2.5-VL-7B, Qwen2-VL-7B, InternVL3-8B) continue them; the MICS-Score ranks candidate paths by how well interns can follow and complete them, favoring reasoning that is both correct and learnable. On benchmarks, the 8B model reports VQA-RAD 76.8%, SLAKE 83.2%, PathVQA 74.0%, PMC-VQA 57.5%, and MMMU Health & Medicine 54.6%, generally outperforming larger general-purpose and medical baselines such as HuatuoGPT-Vision-34B and Gemini-2.5-Pro on several tasks. On the held-out MMRP reasoning split it reaches 92.1% (pure text) and 58.4% (multimodal) accuracy.

Applications

Chiron-o1 targets medical visual question answering and reasoning across modalities including radiology, pathology, and general clinical imagery. Its explicit reasoning traces make it useful for research on interpretable clinical decision support, medical education and tutoring, and as a base for further fine-tuning. The compact 2B variant runs on modest GPUs (~8GB), lowering the barrier for academic groups and resource-constrained deployments to experiment with reasoning-capable medical MLLMs.

Impact

By reframing medical MLLM training around searched, ranked chain-of-thought data rather than direct-answer supervision, Chiron-o1 demonstrates that collaborative search over reasoning paths can yield more generalizable clinical reasoning. Its NeurIPS 2025 acceptance, openly released 2B and 8B weights, and MIT-licensed code make it a practical reference point for the growing class of reasoning-oriented medical foundation models. Key limitations include reliance on proprietary mentor models (GPT-4o, Gemini) to generate training data and licensing restrictions on parts of the underlying image corpus (e.g., Radiopaedia), which constrain full reproduction of the training dataset.

Citation

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

Preprint

Sun, H., et al. (2025) Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search.

DOI: 10.48550/arXiv.2506.16962

Recent citations

Papers that recently cited this model.

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning
Junha Jung, Minbyul Jeong, Suhyeon Lim, et al.
Jun 2026
1Influential
EchoSonar-R: A Multi-View Reasoning-Enabled Model for Disease Classification and Report Generation in Echocardiography
Darya Taratynova, Ahmed Aly, Numan Saeed, et al.
Jun 2026
0
ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages
T. Halder, Akash Ghosh, Subhadip Baidya, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
Jing Hao, Yuci Liang, Lizhuo Lin, et al.
arXiv.org · Nov 2025
10
Think Twice to See More: Iterative Visual Reasoning in Medical VLMs
Kaitao Chen, Shaohao Rui, Yankai Jiang, et al.
arXiv.org · Oct 2025
10Influential
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
Chunzheng Zhu, Jiaqi Zeng, Junyue Jiang, et al.
Apr 2026
9
ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation
Jiarui Jin, Haoyu Wang, Xingliang Wu, et al.
arXiv.org · Feb 2026
4
CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare
Akash Ghosh, Tajamul Ashraf, R. Singh, et al.
Mar 2026
2

Citations

Total Citations14

Influential1

References83

GitHub

Stars60

Forks8

Open Issues0

Contributors1

Last Push9mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads21

Likes3

Last Modified1y ago

Pipelineimage-text-to-text

Fields of citing research

Computer Science100%
Medicine92%
Engineering8%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

67Partial

Usability — can I run it?87

Reproducibility — can I retrain it?48

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Mentor-Intern Collaborative Search (MICS): A reasoning-path search scheme where mentor models (GPT-4o, Gemini 2.5 Pro, Qwen2.5-VL-72B) initialize steps and intern models continue them, selecting paths by an MICS-Score that measures how learnable a reasoning path is.

MMRP reasoning dataset: A ranked multimodal medical reasoning dataset combining simple QA pairs, image-text alignment annotations, and MICS-generated multimodal chain-of-thought data for complex cases.

Curriculum learning: Training proceeds from simpler alignment and QA tasks toward harder multimodal CoT, progressively building generalizable reasoning ability.

Two open sizes: Chiron-o1-2B (InternVL3-2B, ~8GB GPU) and Chiron-o1-8B (InternVL3-8B, ~19GB GPU) with released weights under an MIT-licensed codebase.

Verifiable reasoning traces: Outputs include explicit step-by-step chains rather than opaque single answers, aiding interpretability in clinical contexts.

Technical Details

Applications

Impact

Citation

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

Preprint

Sun, H., et al. (2025) Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search.

DOI: 10.48550/arXiv.2506.16962

Recent citations

Papers that recently cited this model.

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

Junha Jung, Minbyul Jeong, Suhyeon Lim, et al.

Jun 2026

1Influential

EchoSonar-R: A Multi-View Reasoning-Enabled Model for Disease Classification and Report Generation in Echocardiography

Darya Taratynova, Ahmed Aly, Numan Saeed, et al.

Jun 2026

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

T. Halder, Akash Ghosh, Subhadip Baidya, et al.

Jun 2026

Top citations

The most-cited papers that cite this model.

OralGPT-Omni: A Versatile Dental Multimodal Large Language Model

Jing Hao, Yuci Liang, Lizhuo Lin, et al.

arXiv.org · Nov 2025

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

Kaitao Chen, Shaohao Rui, Yankai Jiang, et al.

arXiv.org · Oct 2025

10Influential

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

Chunzheng Zhu, Jiaqi Zeng, Junyue Jiang, et al.

Apr 2026

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

Jiarui Jin, Haoyu Wang, Xingliang Wu, et al.

arXiv.org · Feb 2026

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

Akash Ghosh, Tajamul Ashraf, R. Singh, et al.

Mar 2026

Chiron-o1

#Key Features

#Technical Details

#Applications

#Impact

Citation

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

Recent citations

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

EchoSonar-R: A Multi-View Reasoning-Enabled Model for Disease Classification and Report Generation in Echocardiography

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Top citations

OralGPT-Omni: A Versatile Dental Multimodal Large Language Model

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Chiron-o1

#Key Features

#Technical Details

#Applications

#Impact

Citation

Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

Recent citations

Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning

EchoSonar-R: A Multi-View Reasoning-Enabled Model for Disease Classification and Report Generation in Echocardiography

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

Top citations

OralGPT-Omni: A Versatile Dental Multimodal Large Language Model

Think Twice to See More: Iterative Visual Reasoning in Medical VLMs

MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution

ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact