EyeFound

The Hong Kong Polytechnic University / Sun Yat-sen University / National University of Singapore / EPFL

Ophthalmic imaging foundation model pretrained on 2.78M images across 11 modalities for diagnosis, prognosis, and visual question answering.

Released: May 2024

EyeFound is a multimodal generalist foundation model for ophthalmic imaging, developed to provide a single pretrained backbone that generalizes across the many imaging types used in eye care. Ophthalmology is unusually multimodal: clinicians routinely combine color fundus photographs, optical coherence tomography (OCT), fluorescein and indocyanine green angiography, ultra-widefield imaging, and several other modalities to diagnose and monitor disease. Most prior medical AI models target a single modality and a single task, limiting their reuse. EyeFound instead learns transferable representations from large volumes of unlabeled, heterogeneous ophthalmic images that can then be adapted, with modest labeled data, to a wide range of downstream applications.

The model was introduced in May 2024 by Danli Shi, Weiyi Zhang, Mingguang He and colleagues, led from the School of Optometry at The Hong Kong Polytechnic University, with collaborators at Sun Yat-sen University (Zhongshan Ophthalmic Center), the National University of Singapore, and EPFL. EyeFound builds directly on the lineage of retina-specific foundation models such as RETFound, which was trained primarily on color fundus and OCT images, by extending self-supervised pretraining across a far broader set of ophthalmic modalities.

By covering 11 imaging modalities in one model, EyeFound aims to serve as a shared starting point for ophthalmic AI development, reducing the need to train bespoke models for every imaging device and clinical question.

Key Features

Multimodal coverage: Pretrained across 11 common ophthalmic imaging modalities, letting a single backbone support tasks that span fundus, OCT, angiography, and other image types rather than one modality at a time.
Self-supervised pretraining: Uses masked image modeling on unlabeled data, so the model learns from large image collections without requiring expert annotations for pretraining.
Generalist downstream adaptation: Fine-tunes efficiently to diverse tasks, including eye disease diagnosis, prediction of systemic disease events, and multimodal visual question answering.
Zero-shot VQA: Supports zero-shot multimodal visual question answering over ophthalmic images, pointing toward interactive, report-style clinical assistance.

Technical Details

EyeFound uses a Masked Autoencoder (MAE) framework for self-supervised pretraining. The encoder is a Vision Transformer of ViT-Large scale (24 transformer blocks, embedding dimension 1,024) paired with a lightweight ViT-Small decoder (8 blocks, embedding dimension 512); pretraining masks roughly 80% of image patches and reconstructs them. The model was trained on 2.78 million retinal and ophthalmic images drawn from 227 hospitals, spanning 11 imaging modalities, with images preprocessed to 256×256 and augmented patches of 224×224. Pretraining ran for 50 epochs (15 warmup) with a peak learning rate of 1×10⁻³. Downstream adaptation uses parameter-efficient Low-Rank Adaptation (LoRA). Across reported evaluations, EyeFound outperformed RETFound on eye disease diagnosis and systemic disease prediction and demonstrated strong zero-shot multimodal VQA performance.

Applications

EyeFound is intended as a reusable backbone for ophthalmic AI research and clinical decision support. Researchers can adapt it to classify and grade conditions such as diabetic retinopathy, glaucoma, and age-related macular degeneration; to predict the incidence of systemic diseases from retinal images (oculomics); and to build question-answering tools that interpret multimodal eye scans. Because it covers many modalities, it is especially useful in settings that operate heterogeneous imaging equipment, and it lowers the labeling burden for groups developing new diagnostic models on limited annotated cohorts.

Impact

EyeFound contributes to the rapid expansion of medical imaging foundation models by demonstrating that a single self-supervised model can span the breadth of ophthalmic modalities rather than specializing in one. By extending the RETFound concept to 11 modalities and adding multimodal visual question answering, it helps chart a path toward generalist ophthalmic AI assistants. As a preprint, its results await peer review and independent external validation, and code and weights availability should be confirmed before clinical or production use; nonetheless, it is a notable reference point in the emerging landscape of multimodal medical foundation models.

Citation

EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging

Preprint

Shi, D., et al. (2024) EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging. arXiv.org.

DOI: 10.48550/arXiv.2405.11338

Recent citations

Papers that recently cited this model.

From generalization to precision: A large domain-specific pretrained model for specialized medical tasks.
Zhongwen Li, Yangyang Wang, Lei Wang, et al.
Cell Reports Medicine · Jul 2026
0
EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining
Zhuo Deng, Ruiheng Zhang, Ziheng Zhang, et al.
Jun 2026
0
A gated task-attentive multi-task network for unified retinal image analysis
M. Sajid, Imran Qureshi, Muhammad Fareed Hamid, et al.
Scientific Reports · May 2026
0

Top citations

The most-cited papers that cite this model.

A multimodal visual–language foundation model for computational ophthalmology
Danli Shi, Weiyi Zhang, Jianchen Yang, et al.
npj Digital Medicine · Jun 2025
48
Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: Applications, Challenges, and Future Directions
Kai Sun, Siyan Xue, Fuchun Sun, et al.
Artif. Intell. Medicine · Dec 2024
39
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
Yexin Liu, Zhengyang Liang, Yueze Wang, et al.
Computer Vision and Pattern Recognition · Jun 2024
29
EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis
Danli Shi, Weiyi Zhang, Jianchen Yang, et al.
arXiv.org · Sep 2024
26
Artificial Intelligence for Optical Coherence Tomography in Glaucoma
Mak B. Djulbegovic, Henry Bair, D. T. Gonzalez, et al.
Translational Vision Science & Technology · Jan 2025
20

Citations

Total Citations43

Influential4

References17

Fields of citing research

Medicine95%
Computer Science90%
Engineering43%
Biology5%
Physics2%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

4Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Multimodal coverage: Pretrained across 11 common ophthalmic imaging modalities, letting a single backbone support tasks that span fundus, OCT, angiography, and other image types rather than one modality at a time.

Self-supervised pretraining: Uses masked image modeling on unlabeled data, so the model learns from large image collections without requiring expert annotations for pretraining.

Generalist downstream adaptation: Fine-tunes efficiently to diverse tasks, including eye disease diagnosis, prediction of systemic disease events, and multimodal visual question answering.

Zero-shot VQA: Supports zero-shot multimodal visual question answering over ophthalmic images, pointing toward interactive, report-style clinical assistance.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

From generalization to precision: A large domain-specific pretrained model for specialized medical tasks.

Zhongwen Li, Yangyang Wang, Lei Wang, et al.

Cell Reports Medicine · Jul 2026

EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

Zhuo Deng, Ruiheng Zhang, Ziheng Zhang, et al.

Jun 2026

A gated task-attentive multi-task network for unified retinal image analysis

M. Sajid, Imran Qureshi, Muhammad Fareed Hamid, et al.

Scientific Reports · May 2026

EyeFound

#Key Features

#Technical Details

#Applications

#Impact

Citation

EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging

Recent citations

EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

EyeFound

#Key Features

#Technical Details

#Applications

#Impact

Citation

EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging

Recent citations

EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact