bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Imaging foundation models
Imaging

EchoCLIP

Cedars-Sinai Medical Center / UCLA / UCSF

A CLIP-based vision-language foundation model for echocardiography, trained on over 1 million echocardiogram videos paired with expert reports for zero-shot cardiac interpretation.

Released: May 2024

EchoCLIP is a vision-language foundation model for echocardiogram interpretation developed by Matthew Christensen, Milos Vukadinovic, Neal Yuan, and David Ouyang at Cedars-Sinai Medical Center, with collaborators at UCLA and UCSF. Published in Nature Medicine in May 2024, it addresses a central challenge in cardiac ultrasound: building a single model that generalizes across the many distinct interpretation tasks clinicians perform, rather than training a separate supervised network for each measurement or finding.

Echocardiography is the most common cardiac imaging modality, but conventional deep learning models for it are narrow, each trained end-to-end for one task such as ejection fraction regression or valve detection. EchoCLIP instead applies the contrastive language-image pretraining (CLIP) paradigm to echocardiography, learning a joint embedding space from raw echocardiogram frames and the free-text interpretations cardiologists write during routine clinical reading. Because supervision comes from existing clinical reports rather than task-specific labels, the model learns broadly transferable representations that support zero-shot prediction across a wide range of interpretation tasks.

The work is distinct from other echocardiography foundation models such as EchoJEPA (which uses a joint-embedding predictive, vision-only objective). EchoCLIP's defining feature is the alignment of images with natural-language reports, which enables both quantitative measurement estimation and language-driven capabilities like image-to-text search.

#Key Features

  • Vision-language contrastive pretraining: An image encoder and a text encoder project echocardiogram frames and physician interpretations into a shared embedding space, letting the model relate images to natural-language clinical descriptions without task-specific labels.
  • Zero-shot interpretation: Despite never being explicitly trained for individual tasks, EchoCLIP estimates left ventricular ejection fraction and detects implanted cardiac devices directly from the learned embeddings.
  • Long-context variant (EchoCLIP-R): A second model uses a custom echocardiography tokenizer that compresses full reports to roughly 64 tokens, enabling robust image-to-text retrieval and report-level reasoning.
  • Patient and clinical-state recognition: EchoCLIP-R identifies unique patients across separate studies (AUC 0.86) and recognizes clinical transitions such as heart transplants (AUC 0.79) and cardiac surgery (AUC 0.77).
  • MIT-licensed weights with non-commercial code: The EchoCLIP and EchoCLIP-R weights are released under the MIT License (loadable via the open-source OpenCLIP framework), while the GitHub inference and demonstration scripts ship under a Cedars-Sinai Academic Software License restricted to academic/non-profit research, with commercial use and redistribution prohibited.

#Technical Details

EchoCLIP pairs a ConvNeXt-Base image encoder with a decoder-only transformer text encoder matching the original CLIP architecture (77-token context, byte-pair encoding). It was trained on 1,032,975 cardiac ultrasound videos and their corresponding expert text interpretations, drawn from 224,685 studies across 99,870 patients imaged at Cedars-Sinai between 2011 and 2022. The long-context EchoCLIP-R variant replaces BPE with a template-based tokenizer built from common echocardiography concepts, reducing reports to roughly 64 tokens to fit full interpretations within context.

On benchmarks, EchoCLIP predicts left ventricular ejection fraction with a mean absolute error of 8.4% on the held-out internal test set and 7.1% on the external EchoNet-Dynamic dataset from Stanford Healthcare. It identifies implanted intracardiac devices with AUCs of 0.84 for pacemakers, 0.92 for percutaneous mitral valve repair, and 0.97 for artificial aortic valves. These results were achieved without task-specific fine-tuning, using zero-shot prompting of the joint embedding space.

#Applications

EchoCLIP serves as a general-purpose backbone for cardiac imaging research and clinical decision support. Its zero-shot capabilities support automated estimation of functional measures like ejection fraction, detection of implanted devices and structural findings, and quality control or triage during echocardiographic reading. The language-aligned embedding space enables image-to-text search across echocardiogram archives, patient re-identification for longitudinal study linkage, and recognition of clinical transitions such as surgery or transplant — capabilities useful to cardiologists, sonographers, and researchers building downstream cardiac AI tools.

#Impact

EchoCLIP demonstrated that large-scale vision-language pretraining on routinely generated clinical reports can yield a single echocardiography model that generalizes across diverse interpretation tasks without per-task supervision, an influential proof of concept for report-supervised medical imaging foundation models. By releasing MIT-licensed weights through the widely used echonet ecosystem (with accompanying inference code under a non-commercial academic license), it provided a reusable resource for the cardiac imaging research community. Key limitations include training on data from a single health system (with external validation limited to one public benchmark), the proprietary and non-shareable nature of the training corpus due to patient privacy, and the reliance on report quality and conventions that may not transfer across institutions.

Citation

Vision–language foundation model for echocardiogram interpretation

Christensen, M., et al. (2024) Vision–language foundation model for echocardiogram interpretation. Nature Medicine.

DOI: 10.1038/s41591-024-02959-y

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations202
Influential13
References40

GitHub

Stars48
Forks14
Open Issues6
Contributors3
Last Push1y ago
LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
32Closed
Usability — can I run it?54
Reproducibility — can I retrain it?4
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

cardiac_ultrasoundcontrastive_learningconvnextechocardiographyejection_fraction_estimationfoundation_modelimage_to_text_retrievalmultimodaltransformerzero_shot_classification

Resources

GitHub RepositoryResearch PaperDataset