The Hong Kong Polytechnic University / EPFL / Clemson University / Zhejiang University School of Medicine / Shanghai Jiao Tong University / Monash University
A CLIP-based visual-language foundation model for multi-modal ophthalmic imaging, enabling zero-shot disease detection across 11 modalities including fundus, OCT, and slit-lamp.
EyeCLIP is a visual-language foundation model for computational ophthalmology that learns shared representations across many distinct eye-imaging modalities from largely unlabeled clinical data. Ophthalmic care relies on a heterogeneous mix of imaging — color fundus photography, OCT, angiography, slit-lamp, and more — but most prior deep-learning models are trained for a single modality and a single disease, limiting their utility for the long tail of rare conditions and for systemic diseases that manifest in the eye. EyeCLIP addresses this by adapting the contrastive language-image pretraining (CLIP) paradigm to the multi-modal ophthalmic setting, where paired clinical text is often sparse.
The model was developed by researchers at The Hong Kong Polytechnic University (School of Optometry and Research Centre for SHARP Vision) together with collaborators at EPFL, Clemson University, Zhejiang University, Shanghai Jiao Tong University, Monash University, and several eye hospitals. It was first released as a preprint in September 2024 and published in npj Digital Medicine in June 2025.
By combining self-supervised image reconstruction with both image-image and image-text contrastive objectives, EyeCLIP builds a unified embedding space that supports zero-shot and few-shot disease recognition, visual question answering, and cross-modal retrieval — tasks that are difficult for conventional single-modality classifiers.
EyeCLIP extends the CLIP framework with a vision-transformer image encoder, a text encoder, and an added image decoder following the Masked Autoencoders (MAE) design, so that masked image reconstruction is optimized jointly with the two contrastive losses. It was pretrained on 2,777,593 multi-modal ophthalmic images and 11,180 clinical reports drawn from 128,554 patients — a corpus in which only a minority of images have paired text, motivating the self-supervised component. Across 14 benchmark datasets, EyeCLIP reached state-of-the-art results in disease classification, visual question answering, and cross-modal retrieval. Reported zero-shot AUCs on color fundus photography reach roughly 0.68–0.76 for diabetic retinopathy and glaucoma, OCT classification reaches AUROC up to 0.80, and cross-modal retrieval on the Retina Image Bank achieves mean recall around 50% for text-to-image search; systemic-disease prediction from UK Biobank retinal images was also demonstrated.
EyeCLIP is intended as a general-purpose backbone for ophthalmic AI. Because it works in zero- and few-shot regimes, clinicians and researchers can apply it to screening and triage tasks where labeled training data is scarce — including rare diseases and modalities underrepresented in public datasets. Its unified embedding space supports building diagnostic classifiers with minimal fine-tuning, answering clinical questions about images, and retrieving similar cases for education or second opinions. The demonstrated ability to flag systemic conditions from retinal imaging also points to applications in population-level health screening.
EyeCLIP is among the first vision-language foundation models to span the full breadth of ophthalmic imaging rather than a single modality, and its publication in npj Digital Medicine reflects growing interest in multi-modal medical foundation models. By releasing code and pretrained weights, the authors lower the barrier for downstream groups to fine-tune for specific clinical tasks, situating EyeCLIP alongside contemporaries such as EyeFound and VisionFM in the emerging ophthalmic-foundation-model landscape. Its main limitations are the modest absolute zero-shot accuracy on some tasks and the sparsity of paired text, which constrains the language side of the model and leaves room for improvement as larger annotated ophthalmic corpora become available.
Shi, D., et al. (2025) A multimodal visual–language foundation model for computational ophthalmology. npj Digital Medicine.
DOI: 10.1038/s41746-025-01772-2Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data