Chinese University of Hong Kong
A multi-modal, multi-task vision foundation model for generalist ophthalmic AI, pretrained on 3.4M images from 560K+ individuals across 8 imaging modalities.
VisionFM is a multi-modal, multi-task vision foundation model built for generalist ophthalmic artificial intelligence. Rather than training a separate narrow model for each eye disease or imaging device, VisionFM learns broadly transferable representations of ocular tissue that can be adapted to a wide spectrum of downstream clinical tasks. It addresses a long-standing bottleneck in ophthalmic AI: most prior systems were single-task and single-modality, requiring large labeled datasets and costly retraining whenever a new disease, modality, or population was introduced.
Developed by the Advanced Biomedical Intelligence Lab (ABILab) at the Chinese University of Hong Kong (CUHK) together with a large collaborating clinical consortium, VisionFM was pretrained on 3.4 million ophthalmic images from 560,457 individuals, spanning a broad range of diseases, imaging devices, and demographics. The work was first released as a preprint in October 2023 and subsequently published in NEJM AI in 2024.
VisionFM sits alongside other retinal and ophthalmic foundation models such as RETFound, EyeFound, and EyeCLIP, but is distinguished by its breadth of imaging modalities and the diversity of clinical tasks it supports from a single pretrained backbone, including screening, diagnosis, prognosis, phenotype subclassification, and systemic biomarker prediction.
VisionFM uses a Vision Transformer (ViT) backbone trained with a DINO-style self-supervised self-distillation objective, with separate encoders learned per imaging modality. The pretraining set comprises 3.4 million images from 560,457 individuals, augmented with synthetic data. Downstream task heads are attached and fine-tuned for classification, segmentation, and detection. Across large-scale benchmarks for diagnosis, segmentation, and detection, VisionFM outperformed baseline deep neural networks and demonstrated strong generalization to new modalities and previously unseen datasets. The official repository releases modality-specific pretrained weights for all eight modalities, fine-tuning code, fine-tuned weights on eight public multiclass disease-recognition datasets, and synthetic datasets; public downstream datasets such as IDRiD, OCTID, and DRIVE are supported, with private evaluation data available under a signed data-use agreement.
VisionFM is designed for clinical and translational ophthalmology workflows where labeled data are scarce or where a unified system must handle many diseases and devices. Practical use cases include automated screening and triage for conditions such as diabetic retinopathy, glaucoma, and age-related macular degeneration; segmentation of retinal vessels and anatomical landmarks; disease prognosis; and the prediction of systemic biomarkers and diseases from ocular images. Researchers benefit from a pretrained backbone that can be fine-tuned on modest labeled datasets, lowering the barrier to building new ophthalmic AI applications.
By demonstrating that a single self-supervised backbone can generalize across eight imaging modalities and a wide range of clinical tasks, VisionFM helped establish the foundation-model paradigm in ophthalmology. Its publication in NEJM AI and the public release of pretrained weights, fine-tuning code, and synthetic data have made it a reference point for subsequent ophthalmic foundation models and comparative studies. Its main limitations are a research-and-education-only license that precludes commercial use, and reliance on partly private evaluation data, which complicates fully independent reproduction of some reported results.
Qiu, J., et al. (2023) VisionFM: a Multi-Modal Multi-Task Vision Foundation Model for Generalist Ophthalmic Artificial Intelligence. NEJM AI.
DOI: 10.1056/AIoa2300221Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data