EchoCLIP

Cedars-Sinai Medical Center / UCLA / UCSF

CLIP-style vision-language model for echocardiography, pretrained on echo videos and cardiologist reports for zero-shot cardiac interpretation.

Released: May 2024

EchoCLIP is a vision-language foundation model for echocardiogram interpretation developed by Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang at Cedars-Sinai Medical Center, with collaborators at UCLA and UCSF. Published in Nature Medicine in May 2024, it addresses a central challenge in cardiac ultrasound: building a single model that generalizes across the many distinct interpretation tasks clinicians perform, rather than training a separate supervised network for each measurement or finding.

Echocardiography is the most common cardiac imaging modality, but conventional deep learning models for it are narrow, each trained end-to-end for one task such as ejection fraction regression or valve detection. EchoCLIP instead applies the contrastive language-image pretraining (CLIP) paradigm to echocardiography, learning a joint embedding space from raw echocardiogram frames and the free-text interpretations cardiologists write during routine clinical reading. Because supervision comes from existing clinical reports rather than task-specific labels, the model learns broadly transferable representations that support zero-shot prediction across a wide range of interpretation tasks.

The work is distinct from other echocardiography foundation models such as EchoJEPA (which uses a joint-embedding predictive, vision-only objective). EchoCLIP's defining feature is the alignment of images with natural-language reports, which enables both quantitative measurement estimation and language-driven capabilities like image-to-text search.

Key Features

Vision-language contrastive pretraining: An image encoder and a text encoder project echocardiogram frames and physician interpretations into a shared embedding space, letting the model relate images to natural-language clinical descriptions without task-specific labels.
Zero-shot interpretation: Despite never being explicitly trained for individual tasks, EchoCLIP estimates left ventricular ejection fraction and detects implanted cardiac devices directly from the learned embeddings.
Long-context variant (EchoCLIP-R): A second model uses a custom echocardiography tokenizer that compresses full reports to roughly 64 tokens, enabling robust image-to-text retrieval and report-level reasoning.
Patient and clinical-state recognition: EchoCLIP-R identifies unique patients across separate studies (AUC 0.86) and recognizes clinical transitions such as heart transplants (AUC 0.79) and cardiac surgery (AUC 0.77).
MIT-licensed weights with non-commercial code: The EchoCLIP and EchoCLIP-R weights are released under the MIT License (loadable via the open-source OpenCLIP framework), while the GitHub inference and demonstration scripts ship under a Cedars-Sinai Academic Software License restricted to academic/non-profit research, with commercial use and redistribution prohibited.

Technical Details

EchoCLIP pairs a ConvNeXt-Base image encoder with a decoder-only transformer text encoder matching the original CLIP architecture (77-token context, byte-pair encoding). It was trained on 1,032,975 cardiac ultrasound videos and their corresponding expert text interpretations, drawn from 224,685 studies across 99,870 patients imaged at Cedars-Sinai between 2011 and 2022. The long-context EchoCLIP-R variant replaces BPE with a template-based tokenizer built from common echocardiography concepts, reducing reports to roughly 64 tokens to fit full interpretations within context.

On benchmarks, EchoCLIP predicts left ventricular ejection fraction with a mean absolute error of 8.4% on the held-out internal test set and 7.1% on the external EchoNet-Dynamic dataset from Stanford Healthcare. It identifies implanted intracardiac devices with AUCs of 0.84 for pacemakers, 0.92 for percutaneous mitral valve repair, and 0.97 for artificial aortic valves. These results were achieved without task-specific fine-tuning, using zero-shot prompting of the joint embedding space.

Applications

EchoCLIP serves as a general-purpose backbone for cardiac imaging research and clinical decision support. Its zero-shot capabilities support automated estimation of functional measures like ejection fraction, detection of implanted devices and structural findings, and quality control or triage during echocardiographic reading. The language-aligned embedding space enables image-to-text search across echocardiogram archives, patient re-identification for longitudinal study linkage, and recognition of clinical transitions such as surgery or transplant — capabilities useful to cardiologists, sonographers, and researchers building downstream cardiac AI tools.

Impact

EchoCLIP demonstrated that large-scale vision-language pretraining on routinely generated clinical reports can yield a single echocardiography model that generalizes across diverse interpretation tasks without per-task supervision, an influential proof of concept for report-supervised medical imaging foundation models. By releasing MIT-licensed weights through the widely used echonet ecosystem (with accompanying inference code under a non-commercial academic license), it provided a reusable resource for the cardiac imaging research community. Key limitations include training on data from a single health system (with external validation limited to one public benchmark), the proprietary and non-shareable nature of the training corpus due to patient privacy, and the reliance on report quality and conventions that may not transfer across institutions.

Citation

Vision–language foundation model for echocardiogram interpretation

Christensen, M., et al. (2024) Vision–language foundation model for echocardiogram interpretation. Nature Medicine.

DOI: 10.1038/s41591-024-02959-y

Recent citations

Papers that recently cited this model.

Motion-Conditioned Multi-View Fusion for Myocardial Infarction Localization from Echocardiography
Guang Yang, Wentian Xu, Siyue Wang, et al.
Jul 2026
0
Language-Guided Segmentation of Medical Images: A Review of Foundation Models
Saqib Qamar
Bioengineering · Jul 2026
0
Multimodal AI in healthcare: Review of vision-language foundation models for real-world medical applications.
Taha Razzaq, Murtaza Taj, Asim Iqbal
Journal of Biomedical Informatics · Jul 2026
0

Top citations

The most-cited papers that cite this model.

A Vision-Language Foundation Model for Precision Oncology
Jinxi Xiang, Xiyue Wang, Xiaoming Zhang, et al.
Nature · Jan 2025
245
A generalist medical language model for disease diagnosis assistance
Xiaohong Liu, Hao Liu, Guoxing Yang, et al.
Nature Medicine · Jan 2025
205
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset
L. Blankemeier, J. Cohen, Ashwin Kumar, et al.
Nature · Jun 2024
128
A visual–omics foundation model to bridge histopathology with spatial transcriptomics
Weiqing Chen, Pengzhi Zhang, T. Tran, et al.
Nature Methods · May 2025
89
CLIP in medical imaging: A survey.
Zihao Zhao, Yuxiao Liu, Han Wu, et al.
Medical Image Analysis · Dec 2023
82

Citations

Total Citations218

Influential13

References40

GitHub

Stars51

Forks14

Open Issues6

Contributors3

Last Push1y ago

LanguagePython

Fields of citing research

Medicine96%
Computer Science91%
Engineering33%
Biology3%
Linguistics3%
Environmental Science1%
Chemistry0%
Education0%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

32Closed

Usability — can I run it?54

Reproducibility — can I retrain it?4

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Dataset

Key Features

Vision-language contrastive pretraining: An image encoder and a text encoder project echocardiogram frames and physician interpretations into a shared embedding space, letting the model relate images to natural-language clinical descriptions without task-specific labels.

Zero-shot interpretation: Despite never being explicitly trained for individual tasks, EchoCLIP estimates left ventricular ejection fraction and detects implanted cardiac devices directly from the learned embeddings.

Long-context variant (EchoCLIP-R): A second model uses a custom echocardiography tokenizer that compresses full reports to roughly 64 tokens, enabling robust image-to-text retrieval and report-level reasoning.

Patient and clinical-state recognition: EchoCLIP-R identifies unique patients across separate studies (AUC 0.86) and recognizes clinical transitions such as heart transplants (AUC 0.79) and cardiac surgery (AUC 0.77).

MIT-licensed weights with non-commercial code: The EchoCLIP and EchoCLIP-R weights are released under the MIT License (loadable via the open-source OpenCLIP framework), while the GitHub inference and demonstration scripts ship under a Cedars-Sinai Academic Software License restricted to academic/non-profit research, with commercial use and redistribution prohibited.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

EchoCLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

Vision–language foundation model for echocardiogram interpretation

Recent citations

Motion-Conditioned Multi-View Fusion for Myocardial Infarction Localization from Echocardiography

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

EchoCLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

Vision–language foundation model for echocardiogram interpretation

Recent citations

Motion-Conditioned Multi-View Fusion for Myocardial Infarction Localization from Echocardiography

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact