EyeCLIP

The Hong Kong Polytechnic University / EPFL / Clemson University / Zhejiang University School of Medicine / Shanghai Jiao Tong University / Monash University

CLIP-based vision-language foundation model for eye imaging, enabling zero-shot disease detection and cross-modal retrieval across 11 modalities.

Released: June 2025

EyeCLIP is a visual-language foundation model for computational ophthalmology that learns shared representations across many distinct eye-imaging modalities from largely unlabeled clinical data. Ophthalmic care relies on a heterogeneous mix of imaging — color fundus photography, OCT, angiography, slit-lamp, and more — but most prior deep-learning models are trained for a single modality and a single disease, limiting their utility for the long tail of rare conditions and for systemic diseases that manifest in the eye. EyeCLIP addresses this by adapting the contrastive language-image pretraining (CLIP) paradigm to the multi-modal ophthalmic setting, where paired clinical text is often sparse.

The model was developed by researchers at The Hong Kong Polytechnic University (School of Optometry and Research Centre for SHARP Vision) together with collaborators at EPFL, Clemson University, Zhejiang University, Shanghai Jiao Tong University, Monash University, and several eye hospitals. It was first released as a preprint in September 2024 and published in npj Digital Medicine in June 2025.

By combining self-supervised image reconstruction with both image-image and image-text contrastive objectives, EyeCLIP builds a unified embedding space that supports zero-shot and few-shot disease recognition, visual question answering, and cross-modal retrieval — tasks that are difficult for conventional single-modality classifiers.

Key Features

Eleven imaging modalities: Trained across color fundus photography, FFA, ICGA, OCT, fundus autofluorescence, slit-lamp, scanning laser ophthalmoscopy, B-scan ultrasound, and more, learning a shared cross-modal representation.
Hybrid pretraining objective: Combines masked image reconstruction (an MAE-style decoder) with multi-modal image contrastive learning and image-text contrastive learning, allowing it to leverage the large fraction of images that lack paired text.
Zero- and few-shot disease detection: Recognizes conditions including diabetic retinopathy, glaucoma, and age-related macular degeneration without task-specific labels, helping address rare and long-tail diagnoses.
Systemic disease and cross-modal retrieval: Predicts systemic conditions from retinal images and retrieves relevant cases via text-to-image and image-to-image search.

Technical Details

EyeCLIP extends the CLIP framework with a vision-transformer image encoder, a text encoder, and an added image decoder following the Masked Autoencoders (MAE) design, so that masked image reconstruction is optimized jointly with the two contrastive losses. It was pretrained on 2,777,593 multi-modal ophthalmic images and 11,180 clinical reports drawn from 128,554 patients — a corpus in which only a minority of images have paired text, motivating the self-supervised component. Across 14 benchmark datasets, EyeCLIP reached state-of-the-art results in disease classification, visual question answering, and cross-modal retrieval. Reported zero-shot AUCs on color fundus photography reach roughly 0.68–0.76 for diabetic retinopathy and glaucoma, OCT classification reaches AUROC up to 0.80, and cross-modal retrieval on the Retina Image Bank achieves mean recall around 50% for text-to-image search; systemic-disease prediction from UK Biobank retinal images was also demonstrated.

Applications

EyeCLIP is intended as a general-purpose backbone for ophthalmic AI. Because it works in zero- and few-shot regimes, clinicians and researchers can apply it to screening and triage tasks where labeled training data is scarce — including rare diseases and modalities underrepresented in public datasets. Its unified embedding space supports building diagnostic classifiers with minimal fine-tuning, answering clinical questions about images, and retrieving similar cases for education or second opinions. The demonstrated ability to flag systemic conditions from retinal imaging also points to applications in population-level health screening.

Impact

EyeCLIP is among the first vision-language foundation models to span the full breadth of ophthalmic imaging rather than a single modality, and its publication in npj Digital Medicine reflects growing interest in multi-modal medical foundation models. By releasing code and pretrained weights, the authors lower the barrier for downstream groups to fine-tune for specific clinical tasks, situating EyeCLIP alongside contemporaries such as EyeFound and VisionFM in the emerging ophthalmic-foundation-model landscape. Its main limitations are the modest absolute zero-shot accuracy on some tasks and the sparsity of paired text, which constrains the language side of the model and leaves room for improvement as larger annotated ophthalmic corpora become available.

Citation

A multimodal visual–language foundation model for computational ophthalmology

Shi, D., et al. (2025) A multimodal visual–language foundation model for computational ophthalmology. npj Digital Medicine.

DOI: 10.1038/s41746-025-01772-2

Recent citations

Papers that recently cited this model.

Beyond Metadata: CAPRA for Hidden Subgroup Analysis under Missing Metadata in Medical Imaging
Yawen Li, Yan Li, Zhe Xue, et al.
Jul 2026
0
IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation
Hao Wei, Wenjin Qi, Dasen Dai, et al.
Jul 2026
0
From generalization to precision: A large domain-specific pretrained model for specialized medical tasks.
Zhongwen Li, Yangyang Wang, Lei Wang, et al.
Cell Reports Medicine · Jul 2026
0

Top citations

The most-cited papers that cite this model.

EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery
Guan-Feng Wang, Long Bai, Junyi Wang, et al.
Medical Image Anal. · Jan 2025
42
EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model
Sijing Li, Tianwei Lin, Lingshuai Lin, et al.
ACM Multimedia · Apr 2025
21
Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis
José Morano, Botond Fazekas, Emese Sukei, et al.
npj Digital Medicine · Jun 2025
16Influential
ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model
Shengzhu Yang, Jiawei Du, Jia Guo, et al.
Aug 2024
9
From visual question answering to intelligent AI agents in ophthalmology
Xiaolan Chen, Ruoyu Chen, Pusheng Xu, et al.
British Journal of Ophthalmology · Aug 2025
8

Citations

Total Citations65

Influential4

References58

GitHub

Stars87

Forks13

Open Issues2

Contributors1

Last Push4mo ago

LanguagePython

Fields of citing research

Medicine98%
Computer Science89%
Engineering18%
Mathematics2%
Biology2%
Agricultural and Food Sciences2%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

15Closed

Usability — can I run it?16

Reproducibility — can I retrain it?15

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website

Key Features

Eleven imaging modalities: Trained across color fundus photography, FFA, ICGA, OCT, fundus autofluorescence, slit-lamp, scanning laser ophthalmoscopy, B-scan ultrasound, and more, learning a shared cross-modal representation.

Hybrid pretraining objective: Combines masked image reconstruction (an MAE-style decoder) with multi-modal image contrastive learning and image-text contrastive learning, allowing it to leverage the large fraction of images that lack paired text.

Zero- and few-shot disease detection: Recognizes conditions including diabetic retinopathy, glaucoma, and age-related macular degeneration without task-specific labels, helping address rare and long-tail diagnoses.

Systemic disease and cross-modal retrieval: Predicts systemic conditions from retinal images and retrieves relevant cases via text-to-image and image-to-image search.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Beyond Metadata: CAPRA for Hidden Subgroup Analysis under Missing Metadata in Medical Imaging

Yawen Li, Yan Li, Zhe Xue, et al.

Jul 2026

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Hao Wei, Wenjin Qi, Dasen Dai, et al.

Jul 2026

From generalization to precision: A large domain-specific pretrained model for specialized medical tasks.

Zhongwen Li, Yangyang Wang, Lei Wang, et al.

Cell Reports Medicine · Jul 2026

EyeCLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

A multimodal visual–language foundation model for computational ophthalmology

Recent citations

Beyond Metadata: CAPRA for Hidden Subgroup Analysis under Missing Metadata in Medical Imaging

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Top citations

ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

EyeCLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

A multimodal visual–language foundation model for computational ophthalmology

Recent citations

Beyond Metadata: CAPRA for Hidden Subgroup Analysis under Missing Metadata in Medical Imaging

IRIS: An Intelligent Vision-Language System for Ocular Surface Diseases via Topic Tree and Scene-Driven VQA Generation

Top citations

ViLReF: An Expert Knowledge Enabled Vision-Language Retinal Foundation Model

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact