An 80M-parameter histology Vision Transformer foundation model that predicts spatial gene expression from H&E tissue images and transfers to tumor detection and spatial clustering.
Spatial transcriptomics (ST) jointly profiles gene expression and spatial context alongside histological images, but the technology remains costly and time-consuming, limiting routine clinical use. A practical workaround is to infer gene expression directly from inexpensive hematoxylin-and-eosin (H&E) tissue images, yet prior computational approaches have been constrained by limited accuracy and spatial resolution, often a consequence of small training sets and modest model capacity.
SpaFoundation, introduced in August 2025 by researchers at Central South University (Changsha, China), addresses this gap with a large-scale histology foundation model purpose-built to predict spatial gene expression from tissue images. Rather than training a task-specific predictor from scratch, it learns generalizable histological representations through domain-specific self-supervised pretraining, then applies them to spatial gene expression inference and related downstream tasks with minimal or no fine-tuning.
Within the landscape of histology foundation models, SpaFoundation is distinguished by its explicit focus on spatial omics: it couples a general-purpose image encoder with the goal of high-resolution, transferable spatial gene expression prediction, positioning it alongside contemporaries such as BRIDGE that bridge histology and spatial transcriptomics.
SpaFoundation employs a teacher-student Vision Transformer (ViT) architecture that models dependencies among image patches, using an iBOT-style objective that jointly applies self-distillation and masked image modeling. The model has 80 million parameters and is pretrained on 1.79 million histology patches (the GitHub README cites approximately 1.84 million) spanning 26 tissue types, drawn from the HEST-1K spatial transcriptomics resource, which aggregates data from multiple platforms (including Spatial Transcriptomics, Visium, and Xenium) across human and mouse tissue. The authors validate the model on 117 samples and report that it consistently outperforms state-of-the-art baselines across four downstream tasks: spatial gene expression prediction, high-resolution gene expression inference, tumor detection, and spatial domain clustering. Downstream tumor-detection evaluation uses a cutaneous squamous cell carcinoma (cSCC) dataset (GEO accession GSE144240).
SpaFoundation is aimed at researchers and pathologists who want spatial molecular insight without the expense of full spatial transcriptomics experiments. By inferring gene expression from routine H&E slides, it can extend molecular characterization to large image archives, support virtual ST for cohorts where sequencing is impractical, and provide transferable features for tumor detection and tissue-region clustering. Its open weights make it a candidate encoder for computational pathology and spatial omics pipelines that need a histology backbone tuned for expression-related tasks.
By demonstrating that domain-specific pretraining on roughly 1.79 million histology patches yields representations that beat task-specific baselines across several spatial omics tasks, SpaFoundation reinforces a broader trend toward foundation-model-driven inference of spatial gene expression from cheap imaging. Released openly with code and weights, it lowers the barrier for groups exploring image-to-expression prediction. As a recent preprint, its real-world adoption and independent benchmarking are still emerging, and reported gains should be read in the context of the authors' own evaluation; the model's reliance on H&E appearance also means inferred expression remains a prediction rather than a measurement.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data