A Joint-Embedding Predictive foundation model for echocardiography, pretrained on 18M cardiac ultrasound videos to learn artifact-robust anatomical representations.
EchoJEPA is a self-supervised foundation model for echocardiography (cardiac ultrasound) developed by Alif Munim, Adibvafa Fallahpour, Teodora Szasz, Bo Wang, and colleagues at the University of Toronto, the Vector Institute, and the University Health Network. It was released as a preprint in February 2026 (arXiv:2602.02603). The model addresses a long-standing difficulty in ultrasound machine learning: echocardiograms are dominated by speckle, acoustic shadowing, and operator-dependent artifacts that confound pixel-reconstruction objectives and limit the transferability of supervised models across scanners and patient populations.
Rather than learning to reconstruct raw pixels, EchoJEPA adopts a Joint-Embedding Predictive Architecture (JEPA), in which the model predicts the latent representations of masked spatiotemporal regions from visible context. This latent predictive objective encourages the encoder to capture stable cardiac anatomy and motion while discarding stochastic ultrasound noise that has no predictable structure. EchoJEPA adapts the video JEPA paradigm (V-JEPA2) to the temporal characteristics of cardiac imaging, using higher frame sampling to resolve the rapid dynamics of the beating heart.
Trained on 18 million echocardiogram videos from roughly 300,000 patients — described as the largest pretraining corpus assembled for this modality — EchoJEPA produces general-purpose representations that transfer to clinical tasks including left ventricular ejection fraction (LVEF) estimation, right ventricular systolic pressure (RVSP) estimation, and echocardiographic view classification, with strong sample efficiency and cross-population generalization.
EchoJEPA is built on a Vision Transformer encoder trained with a joint-embedding predictive objective on spatiotemporal echocardiogram clips. The flagship EchoJEPA-G uses a ViT-Giant backbone with approximately 1.1 billion parameters, pretrained on 18.1 million proprietary echocardiogram videos from about 300,000 patients. A reproducible public variant, EchoJEPA-L, uses a ViT-Large encoder (about 307M parameters) pretrained on the 525,000-video MIMIC-IV-Echo dataset. Clips span roughly two seconds, sampled at 8 fps with patch size 16 and tubelet size 2.
Downstream evaluation spans internal cohorts (Toronto, about 150,000 studies; Chicago, about 60,000 studies) and public benchmarks (EchoNet-Dynamic, 10,030 videos; EchoNet-Pediatric, 3,316 videos). On LVEF estimation EchoJEPA-G reaches a mean absolute error of about 3.97, and on RVSP estimation about 4.54 mmHg MAE, improving over leading baselines by roughly 20% and 17% respectively. EchoJEPA-L achieves 85.5% view-classification accuracy. Robustness testing uses physics-informed perturbations — linear depth-attenuation ramps and Gaussian-weighted acoustic shadows of varying severity — under which the model's error rises by only about 2.3%.
EchoJEPA serves as a pretrained backbone for cardiology and cardiac imaging research, where labeled echocardiography data is scarce and expensive to annotate. Its representations support automated estimation of functional measurements such as ejection fraction and right ventricular systolic pressure, automated view recognition for protocol triage and quality control, and rapid adaptation to new tasks with minimal labeled data. The strong zero-shot transfer to pediatric imaging makes it attractive for populations and centers where large labeled datasets do not exist, and its robustness to acoustic artifacts suits deployment across heterogeneous scanners and acquisition conditions.
EchoJEPA demonstrates that latent predictive pretraining, rather than pixel reconstruction, is well matched to noise-dominated medical ultrasound, where much of the pixel signal is stochastic speckle. By assembling the largest reported echocardiography pretraining corpus and showing large gains in sample efficiency, robustness, and cross-population generalization, it provides a reusable foundation for cardiac ultrasound analysis and a template for applying JEPA-style objectives to other artifact-heavy imaging modalities. As a preprint with a publicly released ViT-Large variant trained on open MIMIC-IV-Echo data, its peer-reviewed clinical validation and the generalizability of the proprietary-scale results to external real-world deployment remain to be established.
Munim, A., et al. (2026) EchoJEPA: A Latent Predictive Foundation Model for Echocardiography. arXiv.org.
DOI: 10.48550/arXiv.2602.02603