FetalCLIP

Mohamed bin Zayed University of Artificial Intelligence / Corniche Hospital

Vision-language foundation model for fetal ultrasound, pretrained on 210,035 image-text pairs for plane classification, biometry, and segmentation.

Released: February 2025

FetalCLIP is a vision-language foundation model purpose-built for fetal ultrasound image analysis. Fetal ultrasound is the primary modality for monitoring pregnancy, yet automated interpretation is uniquely difficult: anatomical structures vary rapidly with gestational age, image quality is operator-dependent, and labeled data is scarce because annotation requires specialized obstetric expertise. General-purpose medical imaging models trained on radiology or pathology transfer poorly to this domain, motivating a dedicated foundation model.

Developed by researchers at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in collaboration with clinicians at Corniche Hospital (Abu Dhabi Health Services Company, SEHA) and released as a preprint in February 2025, FetalCLIP adapts the Contrastive Language-Image Pretraining (CLIP) paradigm to the fetal domain. It is pretrained on 210,035 fetal ultrasound images paired with text descriptions—described by the authors as the largest paired dataset of its kind used for foundation model development to date.

By learning a joint image-text embedding space, FetalCLIP produces universal representations that transfer across many clinically relevant downstream tasks, including in zero-shot settings where no task-specific training data is available. This positions it as a backbone for building obstetric ultrasound tools without retraining from scratch for each application.

Key Features

Domain-specific contrastive pretraining: Trained with image-caption contrastive learning on 210,035 fetal ultrasound pairs, aligning ultrasound imagery with clinical text to capture fetal-specific anatomical semantics.
Strong zero-shot transfer: Achieves an 87.1% F1 score on zero-shot fetal plane classification, substantially outperforming the SonoNet baseline (69.9%) without task-specific fine-tuning.
Multi-task versatility: A single backbone supports plane classification, gestational age estimation, congenital heart defect (CHD) detection, and anatomical segmentation.
Extended text encoder: The text encoder accepts up to 117 tokens (versus CLIP's standard 77) to accommodate detailed clinical descriptions and captions.
Label-efficient: Delivers strong performance even with limited labeled data, addressing the chronic annotation bottleneck in fetal imaging.
Publicly released: Code and pretrained weights are available on GitHub and Hugging Face under a non-commercial (CC-BY-NC-4.0) license.

Technical Details

FetalCLIP uses a dual-encoder CLIP architecture, initialized from a general medical-domain CLIP checkpoint and fine-tuned on fetal data using a modified OpenCLIP training pipeline. The image encoder is a ViT-L vision transformer operating on 224×224 inputs with 14×14 patches and 24 transformer layers; the text encoder has 12 transformer layers, and both project into a shared 768-dimensional embedding space. Pretraining maximizes the similarity of paired image-caption embeddings while minimizing that of unpaired examples. On downstream evaluations, FetalCLIP reaches 87.1% F1 on zero-shot plane classification, an 83.5% prediction validity rate for gestational age estimation, and 78.72% AUROC for CHD detection from four-chamber heart videos. For segmentation it attains Dice similarity coefficients of 97.92% (brain view), 81.82% (abdomen view), and 72.91% (four-chamber view).

Applications

FetalCLIP serves as a reusable backbone for obstetric ultrasound AI, enabling automated standard-plane recognition, fetal biometry and gestational age estimation, screening for congenital heart defects, and segmentation of fetal anatomy. Because it transfers in zero-shot and low-label regimes, it is well suited to clinical and research settings where annotated fetal ultrasound data is limited—supporting sonographers and obstetricians with quality control, triage, and decision support, and giving researchers a starting point for new fetal-imaging tools without large labeled datasets.

Impact

FetalCLIP is among the first foundation models tailored specifically to fetal ultrasound, a domain underserved by general medical imaging models. By assembling the largest known paired fetal image-text corpus and demonstrating consistent gains across classification, biometry, anomaly detection, and segmentation, it establishes a strong reference point for vision-language modeling in obstetric imaging and lowers the barrier to building data-efficient prenatal screening tools. Its public release of weights and code (under a non-commercial license) supports reproducibility and downstream research, though the license restricts commercial deployment and, as a preprint-stage model, broader clinical validation remains future work.

Citation

FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

Preprint

Maani, F., et al. (2025) FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis. arXiv.org.

DOI: 10.48550/arXiv.2502.14807

Recent citations

Papers that recently cited this model.

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports
Bing Yan, Chunlei Li, Jingliang Hu, et al.
Jul 2026
0
SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis
Hang Su, Chao Sun, Zhaofan Li, et al.
Jun 2026
0
From Point Estimates to Distributions: GMM Pooling for MIL in Preterm Birth Prediction
Hussain Alasmawi, Numan Saeed, S. Said, et al.
Jun 2026
0

Top citations

The most-cited papers that cite this model.

CLIP in medical imaging: A survey.
Zihao Zhao, Yuxiao Liu, Han Wu, et al.
Medical Image Analysis · Dec 2023
82
EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Chao-Yin She, Ruifang Lu, Lida Chen, et al.
arXiv.org · Sep 2025
6
FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis
Xiaotian Hu, Junwei Huang, Mingxuan Liu, et al.
Mar 2026
2
Advancing Fetal Ultrasound Image Quality Assessment in Low-Resource Settings
Dongli He, Hu Wang, Mohammad Yaqub
arXiv.org · Jul 2025
2
Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Xiao He, Huangxuan Zhao, Guojia Wan, et al.
arXiv.org · Oct 2025
1

Citations

Total Citations22

Influential1

References54

GitHub

Stars70

Forks16

Open Issues4

Contributors2

Last Push5mo ago

LanguagePython

HuggingFace

Downloads0

Likes2

Last Modified1y ago

Fields of citing research

Medicine100%
Computer Science95%
Engineering47%
Mathematics5%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

13Closed

Usability — can I run it?14

Reproducibility — can I retrain it?9

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website HuggingFace Model

Key Features

Domain-specific contrastive pretraining: Trained with image-caption contrastive learning on 210,035 fetal ultrasound pairs, aligning ultrasound imagery with clinical text to capture fetal-specific anatomical semantics.

Strong zero-shot transfer: Achieves an 87.1% F1 score on zero-shot fetal plane classification, substantially outperforming the SonoNet baseline (69.9%) without task-specific fine-tuning.

Multi-task versatility: A single backbone supports plane classification, gestational age estimation, congenital heart defect (CHD) detection, and anatomical segmentation.

Extended text encoder: The text encoder accepts up to 117 tokens (versus CLIP's standard 77) to accommodate detailed clinical descriptions and captions.

Label-efficient: Delivers strong performance even with limited labeled data, addressing the chronic annotation bottleneck in fetal imaging.

Publicly released: Code and pretrained weights are available on GitHub and Hugging Face under a non-commercial (CC-BY-NC-4.0) license.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

Bing Yan, Chunlei Li, Jingliang Hu, et al.

Jul 2026

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

Hang Su, Chao Sun, Zhaofan Li, et al.

Jun 2026

From Point Estimates to Distributions: GMM Pooling for MIL in Preterm Birth Prediction

Hussain Alasmawi, Numan Saeed, S. Said, et al.

Jun 2026

Top citations

The most-cited papers that cite this model.

FetalCLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

Recent citations

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

From Point Estimates to Distributions: GMM Pooling for MIL in Preterm Birth Prediction

Top citations

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

FetalCLIP

#Key Features

#Technical Details

#Applications

#Impact

Citation

FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

Recent citations

Towards Real-World Ultrasound Understanding: Large Vision-Language Models from Multi-Image Examinations with Long-Form Reports

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

From Point Estimates to Distributions: GMM Pooling for MIL in Preterm Birth Prediction

Top citations

FetalAgents: A Multi-Agent System for Fetal Ultrasound Image and Video Analysis

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact