Cedars-Sinai Medical Center / UCLA / UCSF
A CLIP-based vision-language foundation model for echocardiography, trained on over 1 million echocardiogram videos paired with expert reports for zero-shot cardiac interpretation.
EchoCLIP is a vision-language foundation model for echocardiogram interpretation developed by Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang at Cedars-Sinai Medical Center, with collaborators at UCLA and UCSF. Published in Nature Medicine in May 2024, it addresses a central challenge in cardiac ultrasound: building a single model that generalizes across the many distinct interpretation tasks clinicians perform, rather than training a separate supervised network for each measurement or finding.
Echocardiography is the most common cardiac imaging modality, but conventional deep learning models for it are narrow, each trained end-to-end for one task such as ejection fraction regression or valve detection. EchoCLIP instead applies the contrastive language-image pretraining (CLIP) paradigm to echocardiography, learning a joint embedding space from raw echocardiogram frames and the free-text interpretations cardiologists write during routine clinical reading. Because supervision comes from existing clinical reports rather than task-specific labels, the model learns broadly transferable representations that support zero-shot prediction across a wide range of interpretation tasks.
The work is distinct from other echocardiography foundation models such as EchoJEPA (which uses a joint-embedding predictive, vision-only objective). EchoCLIP's defining feature is the alignment of images with natural-language reports, which enables both quantitative measurement estimation and language-driven capabilities like image-to-text search.
EchoCLIP pairs a ConvNeXt-Base image encoder with a decoder-only transformer text encoder matching the original CLIP architecture (77-token context, byte-pair encoding). It was trained on 1,032,975 cardiac ultrasound videos and their corresponding expert text interpretations, drawn from 224,685 studies across 99,870 patients imaged at Cedars-Sinai between 2011 and 2022. The long-context EchoCLIP-R variant replaces BPE with a template-based tokenizer built from common echocardiography concepts, reducing reports to roughly 64 tokens to fit full interpretations within context.
On benchmarks, EchoCLIP predicts left ventricular ejection fraction with a mean absolute error of 8.4% on the held-out internal test set and 7.1% on the external EchoNet-Dynamic dataset from Stanford Healthcare. It identifies implanted intracardiac devices with AUCs of 0.84 for pacemakers, 0.92 for percutaneous mitral valve repair, and 0.97 for artificial aortic valves. These results were achieved without task-specific fine-tuning, using zero-shot prompting of the joint embedding space.
EchoCLIP serves as a general-purpose backbone for cardiac imaging research and clinical decision support. Its zero-shot capabilities support automated estimation of functional measures like ejection fraction, detection of implanted devices and structural findings, and quality control or triage during echocardiographic reading. The language-aligned embedding space enables image-to-text search across echocardiogram archives, patient re-identification for longitudinal study linkage, and recognition of clinical transitions such as surgery or transplant — capabilities useful to cardiologists, sonographers, and researchers building downstream cardiac AI tools.
EchoCLIP demonstrated that large-scale vision-language pretraining on routinely generated clinical reports can yield a single echocardiography model that generalizes across diverse interpretation tasks without per-task supervision, an influential proof of concept for report-supervised medical imaging foundation models. By releasing MIT-licensed weights through the widely used echonet ecosystem (with accompanying inference code under a non-commercial academic license), it provided a reusable resource for the cardiac imaging research community. Key limitations include training on data from a single health system (with external validation limited to one public benchmark), the proprietary and non-shareable nature of the training corpus due to patient privacy, and the reliance on report quality and conventions that may not transfer across institutions.
Christensen, M., et al. (2024) Vision–language foundation model for echocardiogram interpretation. Nature Medicine.
DOI: 10.1038/s41591-024-02959-yPapers that recently cited this model.
The most-cited papers that cite this model.
Not enough data