The Hong Kong Polytechnic University / Sun Yat-sen University / National University of Singapore / EPFL
A multimodal generalist foundation model for ophthalmic imaging, self-supervised on 2.78M images across 11 modalities for diagnosis, prognosis, and visual question answering.
EyeFound is a multimodal generalist foundation model for ophthalmic imaging, developed to provide a single pretrained backbone that generalizes across the many imaging types used in eye care. Ophthalmology is unusually multimodal: clinicians routinely combine color fundus photographs, optical coherence tomography (OCT), fluorescein and indocyanine green angiography, ultra-widefield imaging, and several other modalities to diagnose and monitor disease. Most prior medical AI models target a single modality and a single task, limiting their reuse. EyeFound instead learns transferable representations from large volumes of unlabeled, heterogeneous ophthalmic images that can then be adapted, with modest labeled data, to a wide range of downstream applications.
The model was introduced in May 2024 by Danli Shi, Weiyi Zhang, Mingguang He and colleagues, led from the School of Optometry at The Hong Kong Polytechnic University, with collaborators at Sun Yat-sen University (Zhongshan Ophthalmic Center), the National University of Singapore, and EPFL. EyeFound builds directly on the lineage of retina-specific foundation models such as RETFound, which was trained primarily on color fundus and OCT images, by extending self-supervised pretraining across a far broader set of ophthalmic modalities.
By covering 11 imaging modalities in one model, EyeFound aims to serve as a shared starting point for ophthalmic AI development, reducing the need to train bespoke models for every imaging device and clinical question.
EyeFound uses a Masked Autoencoder (MAE) framework for self-supervised pretraining. The encoder is a Vision Transformer of ViT-Large scale (24 transformer blocks, embedding dimension 1,024) paired with a lightweight ViT-Small decoder (8 blocks, embedding dimension 512); pretraining masks roughly 80% of image patches and reconstructs them. The model was trained on 2.78 million retinal and ophthalmic images drawn from 227 hospitals, spanning 11 imaging modalities, with images preprocessed to 256×256 and augmented patches of 224×224. Pretraining ran for 50 epochs (15 warmup) with a peak learning rate of 1×10⁻³. Downstream adaptation uses parameter-efficient Low-Rank Adaptation (LoRA). Across reported evaluations, EyeFound outperformed RETFound on eye disease diagnosis and systemic disease prediction and demonstrated strong zero-shot multimodal VQA performance.
EyeFound is intended as a reusable backbone for ophthalmic AI research and clinical decision support. Researchers can adapt it to classify and grade conditions such as diabetic retinopathy, glaucoma, and age-related macular degeneration; to predict the incidence of systemic diseases from retinal images (oculomics); and to build question-answering tools that interpret multimodal eye scans. Because it covers many modalities, it is especially useful in settings that operate heterogeneous imaging equipment, and it lowers the labeling burden for groups developing new diagnostic models on limited annotated cohorts.
EyeFound contributes to the rapid expansion of medical imaging foundation models by demonstrating that a single self-supervised model can span the breadth of ophthalmic modalities rather than specializing in one. By extending the RETFound concept to 11 modalities and adding multimodal visual question answering, it helps chart a path toward generalist ophthalmic AI assistants. As a preprint, its results await peer review and independent external validation, and code and weights availability should be confirmed before clinical or production use; nonetheless, it is a notable reference point in the emerging landscape of multimodal medical foundation models.
Shi, D., et al. (2024) EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging. arXiv.org.
DOI: 10.48550/arXiv.2405.11338Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data