VisionFM

Multi-modal ophthalmic foundation model for generalist eye AI, spanning fundus imaging and OCT for disease screening, segmentation, and biomarkers.

Released: October 2023

VisionFM is a multi-modal, multi-task vision foundation model built for generalist ophthalmic artificial intelligence. Rather than training a separate narrow model for each eye disease or imaging device, VisionFM learns broadly transferable representations of ocular tissue that can be adapted to a wide spectrum of downstream clinical tasks. It addresses a long-standing bottleneck in ophthalmic AI: most prior systems were single-task and single-modality, requiring large labeled datasets and costly retraining whenever a new disease, modality, or population was introduced.

Developed by the Advanced Biomedical Intelligence Lab (ABILab) at the Chinese University of Hong Kong (CUHK) together with a large collaborating clinical consortium, VisionFM was pretrained on 3.4 million ophthalmic images from 560,457 individuals, spanning a broad range of diseases, imaging devices, and demographics. The work was first released as a preprint in October 2023 and subsequently published in NEJM AI in 2024.

VisionFM sits alongside other retinal and ophthalmic foundation models such as RETFound, EyeFound, and EyeCLIP, but is distinguished by its breadth of imaging modalities and the diversity of clinical tasks it supports from a single pretrained backbone, including screening, diagnosis, prognosis, phenotype subclassification, and systemic biomarker prediction.

Key Features

Multi-modal coverage: VisionFM provides modality-specific encoders for eight ophthalmic imaging types, including color fundus photography, optical coherence tomography (OCT), fundus fluorescein angiography (FFA), slit-lamp imaging, B-scan ultrasound, external eye imaging, MRI, and ultrasound biomicroscopy (UBM).
Multi-task generalization: A single pretrained foundation supports disease screening and diagnosis, prognosis, disease-phenotype subclassification, segmentation, landmark detection, and systemic biomarker and disease prediction.
Self-supervised pretraining: The model is pretrained without disease labels using a self-distillation approach, enabling it to learn from large unlabeled image corpora and transfer efficiently to labeled downstream tasks.
Synthetic data augmentation: The pretraining corpus is supplemented with generative synthetic ophthalmic images that passed visual Turing tests with practicing ophthalmologists, expanding data diversity.
Expert-level diagnosis: On 12 common eye diseases, VisionFM outperformed ophthalmologists at basic and intermediate experience levels in reported evaluations.

Technical Details

VisionFM uses a Vision Transformer (ViT) backbone trained with a DINO-style self-supervised self-distillation objective, with separate encoders learned per imaging modality. The pretraining set comprises 3.4 million images from 560,457 individuals, augmented with synthetic data. Downstream task heads are attached and fine-tuned for classification, segmentation, and detection. Across large-scale benchmarks for diagnosis, segmentation, and detection, VisionFM outperformed baseline deep neural networks and demonstrated strong generalization to new modalities and previously unseen datasets. The official repository releases modality-specific pretrained weights for all eight modalities, fine-tuning code, fine-tuned weights on eight public multiclass disease-recognition datasets, and synthetic datasets; public downstream datasets such as IDRiD, OCTID, and DRIVE are supported, with private evaluation data available under a signed data-use agreement.

Applications

VisionFM is designed for clinical and translational ophthalmology workflows where labeled data are scarce or where a unified system must handle many diseases and devices. Practical use cases include automated screening and triage for conditions such as diabetic retinopathy, glaucoma, and age-related macular degeneration; segmentation of retinal vessels and anatomical landmarks; disease prognosis; and the prediction of systemic biomarkers and diseases from ocular images. Researchers benefit from a pretrained backbone that can be fine-tuned on modest labeled datasets, lowering the barrier to building new ophthalmic AI applications.

Impact

By demonstrating that a single self-supervised backbone can generalize across eight imaging modalities and a wide range of clinical tasks, VisionFM helped establish the foundation-model paradigm in ophthalmology. Its publication in NEJM AI and the public release of pretrained weights, fine-tuning code, and synthetic data have made it a reference point for subsequent ophthalmic foundation models and comparative studies. Its main limitations are a research-and-education-only license that precludes commercial use, and reliance on partly private evaluation data, which complicates fully independent reproduction of some reported results.

Citation

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Qiu, J., et al. (2023) VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence. NEJM AI.

DOI: 10.1056/AIoa2300221

Recent citations

Papers that recently cited this model.

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation
Hao Wei, Wenjin Qi, Dasen Dai, et al.
Jul 2026
0
A mixture of ophthalmic foundation models enables mitigation of altitude-associated domain shift
Peilun Shi, Jinjing Zhu, Yuan Xie, et al.
Science Bulletin · Jul 2026
0
Fundus-based detection of ischemic stroke using an ophthalmology foundation models
Kai Wu, Yanting Qu, Dongmei Hao, et al.
International Conference on Biomedical Engineering and Medical Devices · Jun 2026
0Influential

Top citations

The most-cited papers that cite this model.

Dietary Assessment With Multimodal ChatGPT: A Systematic Analysis
Frank P.-W. Lo, Jianing Qiu, Zeyu Wang, et al.
IEEE journal of biomedical and health informatics · Dec 2023
49
MEDCO: Medical Education Copilots Based on A Multi-Agent Framework
Hao Wei, Jianing Qiu, Haibao Yu, et al.
ECCV Workshops · Aug 2024
46
Data-Centric Foundation Models in Computational Healthcare: A Survey
Yunkun Zhang, Jin Gao, Zheling Tan, et al.
ACM Computing Surveys · Jan 2024
42
Foundation models in ophthalmology
Mark A. Chia, F. Antaki, Yukun Zhou, et al.
British Journal of Ophthalmology · Jun 2024
41
Visual–language foundation models in medicine
Chunyu Liu, Yixiao Jin, Zhouyu Guan, et al.
The Visual Computer · Jul 2024
29

Citations

Total Citations58

Influential6

References48

GitHub

Stars129

Forks22

Open Issues6

Contributors1

Last Push1y ago

LanguagePython

Fields of citing research

Medicine95%
Computer Science81%
Engineering36%
Education3%
Agricultural and Food Sciences2%
Economics2%
Linguistics2%
Environmental Science2%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

13Closed

Usability — can I run it?14

Reproducibility — can I retrain it?11

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Research Paper

Key Features

Multi-modal coverage: VisionFM provides modality-specific encoders for eight ophthalmic imaging types, including color fundus photography, optical coherence tomography (OCT), fundus fluorescein angiography (FFA), slit-lamp imaging, B-scan ultrasound, external eye imaging, MRI, and ultrasound biomicroscopy (UBM).

Multi-task generalization: A single pretrained foundation supports disease screening and diagnosis, prognosis, disease-phenotype subclassification, segmentation, landmark detection, and systemic biomarker and disease prediction.

Self-supervised pretraining: The model is pretrained without disease labels using a self-distillation approach, enabling it to learn from large unlabeled image corpora and transfer efficiently to labeled downstream tasks.

Synthetic data augmentation: The pretraining corpus is supplemented with generative synthetic ophthalmic images that passed visual Turing tests with practicing ophthalmologists, expanding data diversity.

Expert-level diagnosis: On 12 common eye diseases, VisionFM outperformed ophthalmologists at basic and intermediate experience levels in reported evaluations.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Hao Wei, Wenjin Qi, Dasen Dai, et al.

Jul 2026

A mixture of ophthalmic foundation models enables mitigation of altitude-associated domain shift

Peilun Shi, Jinjing Zhu, Yuan Xie, et al.

Science Bulletin · Jul 2026

Fundus-based detection of ischemic stroke using an ophthalmology foundation models

Kai Wu, Yanting Qu, Dongmei Hao, et al.

International Conference on Biomedical Engineering and Medical Devices · Jun 2026

0Influential

Top citations

The most-cited papers that cite this model.

VisionFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Recent citations

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

VisionFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence

Recent citations

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact