Mahmood Lab / Brigham and Women's Hospital
Vision-language foundation model for computational pathology, pretrained on 1.17M histopathology image-caption pairs with contrastive and captioning objectives.
CONCH (CONtrastive learning from Captions for Histopathology) is a vision-language foundation model for computational pathology developed by the Mahmood Lab at Harvard Medical School and Brigham and Women's Hospital. Published in Nature Medicine in March 2024, CONCH addresses a central challenge in pathology AI: the scarcity of large, labeled datasets that limits conventional supervised approaches. By leveraging paired images and free-text captions from biomedical literature, CONCH learns generalizable representations without requiring exhaustive expert annotation.
The model is pretrained on over 1.17 million histopathology image-caption pairs using the CoCa (Contrastive Captioners) framework, which combines a CLIP-style contrastive objective with a generative image captioning objective. This dual-objective pretraining allows a single model to support both discriminative tasks — such as classification and retrieval — and generative tasks such as automatic caption generation, making CONCH considerably more versatile than prior pathology encoders trained on either images alone or image-text pairs with only a contrastive loss.
CONCH was benchmarked across 14 computational pathology tasks and achieved state-of-the-art performance in zero-shot classification, image-text retrieval, image captioning, and tissue segmentation, outperforming prior vision-language models and self-supervised pathology encoders across this diverse task suite.
CONCH consists of three jointly trained components: a ViT-B/16 vision encoder (~90M parameters) producing patch-level and global image representations; a 12-layer transformer text encoder (L12-E768-H12, ~110M parameters); and an autoregressive multimodal text decoder shared with the text encoder. The two encoders are first pretrained independently, then jointly fine-tuned using the combined CoCa loss combining contrastive alignment between global embeddings and a next-token prediction captioning objective.
The pretraining corpus of 1.17 million image-caption pairs was assembled from PubMed Central Open Access figures filtered to human histopathology using a YOLOv5-based detector, augmented with internally curated Mahmood Lab data. A GPT-style model was used to split multi-panel figure captions and a CLIP-based matcher aligned detected image regions to the appropriate caption text. Images span H&E, IHC, and special stain preparations across numerous tissue types and cancer subtypes. The CONCH model weights are available on HuggingFace under a gated license requiring Mahmood Lab approval.
CONCH is suited to a wide range of computational pathology research workflows. Pathologists and computational researchers can use its text-prompt-driven zero-shot capability to classify tissue types or cancer subtypes without curating labeled training sets. The image-text retrieval capability supports report-guided case search: retrieving similar cases from a database using a text description, or vice versa. Patch-level CONCH features serve as inputs to multiple instance learning (MIL) aggregators for whole-slide image classification and survival analysis. The generative decoder supports automated captioning of histopathology images, useful for report drafting research. The model's multi-stain training makes it applicable to IHC panels and special stains where purely H&E-trained models underperform.
CONCH represents a meaningful advance in the application of vision-language pretraining to computational pathology, demonstrating that large-scale paired biomedical text and images can be mined from scientific literature to train broadly capable models. Its Nature Medicine publication and open code release have contributed to growing adoption of the vision-language paradigm in pathology AI, complementing concurrent work such as PLIP and BiomedCLIP. Key limitations include patch-level inputs requiring external aggregation for whole-slide analysis, sensitivity of zero-shot performance to text prompt formulation, and gated access to model weights. CONCH has not been validated for clinical diagnostic use and requires independent clinical evaluation before any diagnostic application.
Lu, M. Y., et al. (2024) A visual-language foundation model for computational pathology. Nature Medicine.
DOI: 10.1038/s41591-024-02856-4