Mohamed bin Zayed University of Artificial Intelligence / Corniche Hospital
A CLIP-based vision-language foundation model for fetal ultrasound, pretrained on 210,035 image-caption pairs for plane classification, biometry, anomaly detection, and segmentation.
FetalCLIP is a vision-language foundation model purpose-built for fetal ultrasound image analysis. Fetal ultrasound is the primary modality for monitoring pregnancy, yet automated interpretation is uniquely difficult: anatomical structures vary rapidly with gestational age, image quality is operator-dependent, and labeled data is scarce because annotation requires specialized obstetric expertise. General-purpose medical imaging models trained on radiology or pathology transfer poorly to this domain, motivating a dedicated foundation model.
Developed by researchers at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in collaboration with clinicians at Corniche Hospital (Abu Dhabi Health Services Company, SEHA) and released as a preprint in February 2025, FetalCLIP adapts the Contrastive Language-Image Pretraining (CLIP) paradigm to the fetal domain. It is pretrained on 210,035 fetal ultrasound images paired with text descriptions—described by the authors as the largest paired dataset of its kind used for foundation model development to date.
By learning a joint image-text embedding space, FetalCLIP produces universal representations that transfer across many clinically relevant downstream tasks, including in zero-shot settings where no task-specific training data is available. This positions it as a backbone for building obstetric ultrasound tools without retraining from scratch for each application.
FetalCLIP uses a dual-encoder CLIP architecture, initialized from a general medical-domain CLIP checkpoint and fine-tuned on fetal data using a modified OpenCLIP training pipeline. The image encoder is a ViT-L vision transformer operating on 224×224 inputs with 14×14 patches and 24 transformer layers; the text encoder has 12 transformer layers, and both project into a shared 768-dimensional embedding space. Pretraining maximizes the similarity of paired image-caption embeddings while minimizing that of unpaired examples. On downstream evaluations, FetalCLIP reaches 87.1% F1 on zero-shot plane classification, an 83.5% prediction validity rate for gestational age estimation, and 78.72% AUROC for CHD detection from four-chamber heart videos. For segmentation it attains Dice similarity coefficients of 97.92% (brain view), 81.82% (abdomen view), and 72.91% (four-chamber view).
FetalCLIP serves as a reusable backbone for obstetric ultrasound AI, enabling automated standard-plane recognition, fetal biometry and gestational age estimation, screening for congenital heart defects, and segmentation of fetal anatomy. Because it transfers in zero-shot and low-label regimes, it is well suited to clinical and research settings where annotated fetal ultrasound data is limited—supporting sonographers and obstetricians with quality control, triage, and decision support, and giving researchers a starting point for new fetal-imaging tools without large labeled datasets.
FetalCLIP is among the first foundation models tailored specifically to fetal ultrasound, a domain underserved by general medical imaging models. By assembling the largest known paired fetal image-text corpus and demonstrating consistent gains across classification, biometry, anomaly detection, and segmentation, it establishes a strong reference point for vision-language modeling in obstetric imaging and lowers the barrier to building data-efficient prenatal screening tools. Its public release of weights and code (under a non-commercial license) supports reproducibility and downstream research, though the license restricts commercial deployment and, as a preprint-stage model, broader clinical validation remains future work.
Maani, F., et al. (2025) FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis. arXiv.org.
DOI: 10.48550/arXiv.2502.14807Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data