PULSAR

Hierarchical single-cell foundation model trained with Masked Cell Modeling on 36.2M cells to produce zero-shot donor-level embeddings for disease and clinical prediction.

Released: November 2025

Parameters: 87.4 Million

PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a hierarchical, multi-scale single-cell foundation model developed by the Leskovec lab (SNAP group) at Stanford University and released as a bioRxiv preprint in November 2025. While most single-cell foundation models — Geneformer, scGPT, scFoundation, UCE — learn representations at the level of individual cells, PULSAR targets a higher unit of biological and clinical interest: the donor. It converts an unordered set of single cells profiled from a sample into a single fixed-length donor embedding that summarizes that individual's immune state, enabling patient-level inference directly from single-cell RNA-seq data.

The model addresses a persistent gap between single-cell analysis and translational application. Disease phenotypes, biomarker levels, and treatment responses are properties of patients, not isolated cells, yet aggregating per-cell representations into a patient summary typically discards the multicellular structure that encodes immune coordination. PULSAR instead models information flow explicitly across three scales — genes to cells to multicellular systems — so that the resulting donor embedding reflects both individual cell states and their composition within a sample.

PULSAR is currently specialized for peripheral immune profiling. It is pretrained broadly across tissues and then continually trained on peripheral blood mononuclear cell (PBMC) data, making it well suited to the large and growing body of blood-based single-cell cohorts used in immunology, autoimmunity, and clinical studies.

Key Features

Donor-level embeddings: PULSAR maps an unordered set of single cells from a sample to a 512-dimensional donor embedding, shifting the unit of representation from the cell to the patient and enabling direct sample-level prediction.
Hierarchical multi-scale design: The architecture integrates gene-to-cell-to-multicellular information flows, capturing both individual cell states and the multicellular composition that defines a donor's immune profile.
ESM2 protein priors: Gene tokens are grounded in ESM2 protein language model embeddings, injecting protein-sequence knowledge into the single-cell representation and connecting the model to the protein modeling landscape.
Masked Cell Modeling pretraining: A self-supervised Masked Cell Modeling objective — masking and reconstructing cells within a sample — trains the model without disease or phenotype labels, yielding broadly transferable zero-shot embeddings.
Zero-shot clinical inference: Donor embeddings support zero-shot disease classification (e.g., lupus), age regression, plasma-proteomic biomarker prediction, and reference mapping without task-specific fine-tuning.
Generative perturbation simulation: The encoder-decoder design enables in silico simulation of cytokine perturbations, predicting how a donor's cellular landscape shifts under defined immune stimuli.

Technical Details

PULSAR is a Transformer encoder-decoder with approximately 87.4 million parameters and a 1,024-token context length. Gene-level inputs are represented using ESM2 protein embeddings, cell-level representations build on Universal Cell Embeddings (1,280-dimensional), and these are composed hierarchically into a 512-dimensional donor embedding. Pretraining proceeds in two stages: broad pretraining on 36.2 million cells from 6,807 samples spanning 53 tissues drawn from the CZ CELLxGENE Census, followed by continual pretraining on roughly 8.7 million blood/PBMC cells from 2,588 samples. The self-supervised objective is Masked Cell Modeling, which trains the model to reconstruct masked cells from the remaining cellular context of a sample, so that the donor embedding captures both cell identity and multicellular structure.

The model is evaluated on translational tasks including zero-shot age regression (using subsampled OneK1K data), lupus disease classification, plasma-proteomics and vaccine-response prediction, and donor embedding search via an accompanying DONORxEMBED database. Cell-level attention provides interpretability by indicating which cells most influence a given donor-level prediction. Two checkpoints are released: PULSAR-pbmc, a zero-shot model for PBMC tasks, and PULSAR-aligned, a disease-label-aligned variant. The model is explicitly out of scope for non-PBMC tissues and cell-sorted samples.

Applications

PULSAR is aimed at researchers translating single-cell data into patient-level insight, particularly in immunology, autoimmunity, and clinical cohort analysis. Because it produces donor embeddings without requiring labels, it supports zero-shot disease classification, biomarker and clinical-endpoint prediction (such as plasma proteomics and vaccine response), and rapid reference mapping of new samples against existing cohorts via similarity search. Its generative branch lets investigators simulate cytokine perturbations in silico to hypothesize how an individual's immune landscape might respond to stimulation, complementing costly experimental perturbation screens. The PBMC focus aligns the model with the large repositories of blood-based single-cell data common in clinical and population immunology studies.

Impact

PULSAR reframes single-cell foundation modeling around the donor rather than the cell, directly targeting the translational questions — who has disease, what their biomarkers are, how they will respond — that motivate much clinical single-cell profiling. Its integration of ESM2 protein priors, Universal Cell Embeddings, and a multicellular hierarchy demonstrates a path for composing existing foundation models across biological scales, while open MIT-licensed code and weights with a from_pretrained API lower the barrier to adoption. The current release is limited to peripheral immune (PBMC) contexts and excludes non-blood tissues and sorted samples, so generalization to solid-tissue atlases remains future work; a note on licensing is warranted, as the bioRxiv preprint is posted under CC BY-NC-ND while the released code and model weights are MIT-licensed.

Citation

PULSAR: a Foundation Model for Multi-scale and Multi-cellular Biology

Preprint

Pang, K., et al. (2025) PULSAR: a Foundation Model for Multi-scale and Multi-cellular Biology. bioRxiv.

DOI: 10.1101/2025.11.24.685470

Recent citations

Papers that recently cited this model.

From virtual experiments to biomedical insight with synthetic data
M. Victoriano, Milena Pavlović, G. K. Sandve, et al.
Nature Machine Intelligence · Jun 2026
0
GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?
Jonathan G. Hedley, Philip Torr, Kaspar Märtens
bioRxiv · Apr 2026
0

Top citations

The most-cited papers that cite this model.

From virtual experiments to biomedical insight with synthetic data
M. Victoriano, Milena Pavlović, G. K. Sandve, et al.
Nature Machine Intelligence · Jun 2026
0
GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?
Jonathan G. Hedley, Philip Torr, Kaspar Märtens
bioRxiv · Apr 2026
0

Citations

Total Citations2

Influential0

References0

GitHub

Stars34

Forks6

Open Issues2

Contributors1

Last Push3mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads291

Likes3

Last Modified6mo ago

Fields of citing research

Computer Science100%
Biology50%
Medicine50%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

58Partial

Usability — can I run it?92

Reproducibility — can I retrain it?29

open weights, closed recipe

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Donor-level embeddings: PULSAR maps an unordered set of single cells from a sample to a 512-dimensional donor embedding, shifting the unit of representation from the cell to the patient and enabling direct sample-level prediction.

Hierarchical multi-scale design: The architecture integrates gene-to-cell-to-multicellular information flows, capturing both individual cell states and the multicellular composition that defines a donor's immune profile.

ESM2 protein priors: Gene tokens are grounded in ESM2 protein language model embeddings, injecting protein-sequence knowledge into the single-cell representation and connecting the model to the protein modeling landscape.

Masked Cell Modeling pretraining: A self-supervised Masked Cell Modeling objective — masking and reconstructing cells within a sample — trains the model without disease or phenotype labels, yielding broadly transferable zero-shot embeddings.

Zero-shot clinical inference: Donor embeddings support zero-shot disease classification (e.g., lupus), age regression, plasma-proteomic biomarker prediction, and reference mapping without task-specific fine-tuning.

Generative perturbation simulation: The encoder-decoder design enables in silico simulation of cytokine perturbations, predicting how a donor's cellular landscape shifts under defined immune stimuli.

Technical Details

Applications

Impact

PULSAR

Key Features

Technical Details

Applications

Impact

Citation

PULSAR: a Foundation Model for Multi-scale and Multi-cellular Biology

Recent citations

From virtual experiments to biomedical insight with synthetic data

GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

Top citations

From virtual experiments to biomedical insight with synthetic data

GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

PULSAR

Key Features

Technical Details

Applications

Impact

Citation

PULSAR: a Foundation Model for Multi-scale and Multi-cellular Biology

Recent citations

From virtual experiments to biomedical insight with synthetic data

GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

Top citations

From virtual experiments to biomedical insight with synthetic data

GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

PULSAR

#Key Features

#Technical Details

#Applications

#Impact

Citation

PULSAR: a Foundation Model for Multi-scale and Multi-cellular Biology

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

PULSAR

#Key Features

#Technical Details

#Applications

#Impact

Citation

PULSAR: a Foundation Model for Multi-scale and Multi-cellular Biology

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact