Hierarchical single-cell foundation model trained with Masked Cell Modeling on 36.2M cells to produce zero-shot donor-level embeddings for disease and clinical prediction.
PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a hierarchical, multi-scale single-cell foundation model developed by the Leskovec lab (SNAP group) at Stanford University and released as a bioRxiv preprint in November 2025. While most single-cell foundation models — Geneformer, scGPT, scFoundation, UCE — learn representations at the level of individual cells, PULSAR targets a higher unit of biological and clinical interest: the donor. It converts an unordered set of single cells profiled from a sample into a single fixed-length donor embedding that summarizes that individual's immune state, enabling patient-level inference directly from single-cell RNA-seq data.
The model addresses a persistent gap between single-cell analysis and translational application. Disease phenotypes, biomarker levels, and treatment responses are properties of patients, not isolated cells, yet aggregating per-cell representations into a patient summary typically discards the multicellular structure that encodes immune coordination. PULSAR instead models information flow explicitly across three scales — genes to cells to multicellular systems — so that the resulting donor embedding reflects both individual cell states and their composition within a sample.
PULSAR is currently specialized for peripheral immune profiling. It is pretrained broadly across tissues and then continually trained on peripheral blood mononuclear cell (PBMC) data, making it well suited to the large and growing body of blood-based single-cell cohorts used in immunology, autoimmunity, and clinical studies.
PULSAR is a Transformer encoder-decoder with approximately 87.4 million parameters and a 1,024-token context length. Gene-level inputs are represented using ESM2 protein embeddings, cell-level representations build on Universal Cell Embeddings (1,280-dimensional), and these are composed hierarchically into a 512-dimensional donor embedding. Pretraining proceeds in two stages: broad pretraining on 36.2 million cells from 6,807 samples spanning 53 tissues drawn from the CZ CELLxGENE Census, followed by continual pretraining on roughly 8.7 million blood/PBMC cells from 2,588 samples. The self-supervised objective is Masked Cell Modeling, which trains the model to reconstruct masked cells from the remaining cellular context of a sample, so that the donor embedding captures both cell identity and multicellular structure.
The model is evaluated on translational tasks including zero-shot age regression (using subsampled OneK1K data), lupus disease classification, plasma-proteomics and vaccine-response prediction, and donor embedding search via an accompanying DONORxEMBED database. Cell-level attention provides interpretability by indicating which cells most influence a given donor-level prediction. Two checkpoints are released: PULSAR-pbmc, a zero-shot model for PBMC tasks, and PULSAR-aligned, a disease-label-aligned variant. The model is explicitly out of scope for non-PBMC tissues and cell-sorted samples.
PULSAR is aimed at researchers translating single-cell data into patient-level insight, particularly in immunology, autoimmunity, and clinical cohort analysis. Because it produces donor embeddings without requiring labels, it supports zero-shot disease classification, biomarker and clinical-endpoint prediction (such as plasma proteomics and vaccine response), and rapid reference mapping of new samples against existing cohorts via similarity search. Its generative branch lets investigators simulate cytokine perturbations in silico to hypothesize how an individual's immune landscape might respond to stimulation, complementing costly experimental perturbation screens. The PBMC focus aligns the model with the large repositories of blood-based single-cell data common in clinical and population immunology studies.
PULSAR reframes single-cell foundation modeling around the donor rather than the cell, directly targeting the translational questions — who has disease, what their biomarkers are, how they will respond — that motivate much clinical single-cell profiling. Its integration of ESM2 protein priors, Universal Cell Embeddings, and a multicellular hierarchy demonstrates a path for composing existing foundation models across biological scales, while open MIT-licensed code and weights with a from_pretrained API lower the barrier to adoption. The current release is limited to peripheral immune (PBMC) contexts and excludes non-blood tissues and sorted samples, so generalization to solid-tissue atlases remains future work; a note on licensing is warranted, as the bioRxiv preprint is posted under CC BY-NC-ND while the released code and model weights are MIT-licensed.
Pang, K., et al. (2025) PULSAR: a Foundation Model for Multi-scale and Multi-cellular Biology. bioRxiv.
DOI: 10.1101/2025.11.24.685470Papers that recently cited this model.
M. Victoriano, Milena Pavlović, G. K. Sandve, et al.
Nature Machine Intelligence · Jun 2026
Jonathan G. Hedley, Philip Torr, Kaspar Märtens
bioRxiv · Apr 2026
The most-cited papers that cite this model.
M. Victoriano, Milena Pavlović, G. K. Sandve, et al.
Nature Machine Intelligence · Jun 2026
Jonathan G. Hedley, Philip Torr, Kaspar Märtens
bioRxiv · Apr 2026
Share of papers citing this model.