Overview

As foundation models adapted from natural language processing began proliferating in the single-cell biology space, the computational biology community lacked rigorous, apples-to-apples evaluations of how much these large pretrained models actually improve over simpler baselines. "A Deep Dive into Single-Cell RNA Sequencing Foundation Models," by Rebecca Boiarsky and colleagues at MIT CSAIL and the Broad Institute, directly addresses this gap. Published as a bioRxiv preprint in October 2023 and subsequently as a correspondence in Nature Machine Intelligence in December 2024, the study subjects two prominent scRNA-seq foundation models — scBERT and scGPT — to systematic critical evaluation on the task of cell type annotation.

The central finding is striking: for scBERT, a simple L1-regularized logistic regression classifier consistently matches or outperforms the pretrained model on cell type annotation, even in few-shot settings where only a fraction of labeled training data is available. For scGPT, the picture is more nuanced — pretraining does provide a measurable benefit over training from scratch — but the gains depend heavily on hyperparameter choices and random initialization, raising questions about robustness and reproducibility. Ablation experiments further show that removing scBERT's pretraining step does not meaningfully degrade fine-tuning performance, suggesting that the BERT-style masked gene expression pretraining task does not encode representations that transfer usefully to downstream annotation.

This work is best understood not as a standalone model but as a methodological benchmark and a call for higher standards in evaluating biological foundation models. The authors provide fully reproducible code and evaluation pipelines, making it a practical resource for researchers who want to audit new scRNA-seq models against honest baselines before adopting them.

Key Features

Head-to-head comparison against logistic regression: The study establishes L1-regularized logistic regression as a competitive and reproducible baseline, demonstrating that it matches or exceeds scBERT performance across all tested datasets and data fractions.
Few-shot evaluation protocol: Models are evaluated at 10%, 25%, 50%, 75%, and 100% of available training data, providing a comprehensive view of data efficiency and pretraining benefit.
Ablation studies of pretraining: By removing the pretraining step from both scBERT and scGPT, the authors directly quantify how much pretrained representations contribute to downstream task performance.
Sensitivity analysis: The study documents substantial variation in fine-tuning outcomes across hyperparameter settings and random seeds, highlighting a reproducibility concern that affects how benchmark results for scRNA-seq models should be interpreted.
Multiple diverse datasets: Evaluations span four datasets — the Zheng68K PBMC dataset (for scBERT) and the Multiple Sclerosis, Myeloid, and hPancreas datasets (for scGPT) — covering both well-characterized and biologically complex cell populations.
Reproducible codebase: All preprocessing, fine-tuning, baseline, and ablation code is publicly available at the associated GitHub repository, enabling independent replication and extension to other models.

Technical Details

The evaluation framework targets cell type annotation as its primary task, treating it as a multi-class classification problem in which a fine-tuned model assigns cell type labels to individual cells based on their gene expression profiles. scBERT is a BERT-style transformer that uses gene2vec embeddings to represent gene expression values, pretrained with a masked-expression objective on large unlabeled scRNA-seq corpora. scGPT is a GPT-style autoregressive transformer pretrained on over 33 million human cells. Both models are fine-tuned on the labeled datasets described above and compared against an L1-regularized logistic regression trained on the same input features.

The ablation experiments systematically remove or randomize the pretrained weights to measure the isolated contribution of pretraining. Results show that for scBERT, the pretraining task — predicting masked gene expression values from context — does not produce representations that transfer to cell type annotation any better than a randomly initialized or linear model. For scGPT, pretraining does improve performance relative to no-pretraining controls, particularly on the Multiple Sclerosis dataset, though performance is sensitive to fine-tuning hyperparameters and initialization seeds. The study also probes whether gene2vec embeddings embedded in scBERT contribute to performance, finding minimal independent effect. Accuracy and macro-averaged F1 scores are the primary reported metrics.

Applications

This benchmark framework is directly applicable to any researcher evaluating new scRNA-seq foundation models before deployment in cell atlas annotation, disease subtype classification, or rare cell type discovery. The comparison protocols and baseline implementations can be adopted to audit future models — such as Geneformer, scFoundation, or CellPLM — under a consistent evaluation scheme. The findings also inform decisions about when the computational overhead of large pretrained models is justified versus when simpler, more interpretable classifiers should be preferred.

Impact

This study contributed to a broader critical conversation about the standards for evaluating biological foundation models that emerged across the field in 2023–2024. Its central result — that a logistic regression baseline suffices for scBERT-level cell type annotation — prompted responses from the scBERT development team and was cited in subsequent work examining the zero-shot and few-shot capabilities of scRNA-seq models. The expanded version published in Nature Machine Intelligence in December 2024 brought the findings to a wider audience and helped establish tighter expectations for what "improved performance" means in the single-cell foundation model literature. A key limitation of the study is its focus on cell type annotation as the sole evaluation task; other capabilities claimed by these models, such as gene regulatory network inference or batch integration, are not assessed, leaving open questions about the broader utility of pretraining in single-cell biology.

Citations

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Preprint

Boiarsky, R., et al. (2023) A Deep Dive into Single-Cell RNA Sequencing Foundation Models. bioRxiv.

DOI: 10.1101/2023.10.19.563100

Deeper evaluation of a single-cell foundation model

Boiarsky, R., et al. (2024) Deeper evaluation of a single-cell foundation model. Nature Machine Intelligence.

DOI: 10.1038/s42256-024-00949-w

Overview

Key Features

Head-to-head comparison against logistic regression: The study establishes L1-regularized logistic regression as a competitive and reproducible baseline, demonstrating that it matches or exceeds scBERT performance across all tested datasets and data fractions.

Few-shot evaluation protocol: Models are evaluated at 10%, 25%, 50%, 75%, and 100% of available training data, providing a comprehensive view of data efficiency and pretraining benefit.

Ablation studies of pretraining: By removing the pretraining step from both scBERT and scGPT, the authors directly quantify how much pretrained representations contribute to downstream task performance.

Sensitivity analysis: The study documents substantial variation in fine-tuning outcomes across hyperparameter settings and random seeds, highlighting a reproducibility concern that affects how benchmark results for scRNA-seq models should be interpreted.

Multiple diverse datasets: Evaluations span four datasets — the Zheng68K PBMC dataset (for scBERT) and the Multiple Sclerosis, Myeloid, and hPancreas datasets (for scGPT) — covering both well-characterized and biologically complex cell populations.

Reproducible codebase: All preprocessing, fine-tuning, baseline, and ablation code is publicly available at the associated GitHub repository, enabling independent replication and extension to other models.

Technical Details

Applications

Impact

Citations

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Preprint

Boiarsky, R., et al. (2023) A Deep Dive into Single-Cell RNA Sequencing Foundation Models. bioRxiv.

DOI: 10.1101/2023.10.19.563100

Deeper evaluation of a single-cell foundation model

Boiarsky, R., et al. (2024) Deeper evaluation of a single-cell foundation model. Nature Machine Intelligence.

DOI: 10.1038/s42256-024-00949-w

A Deep Dive into scRNA-seq Foundation Models

Overview

Key Features

Technical Details

Applications

Impact

Citations

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Deeper evaluation of a single-cell foundation model

Metrics

GitHub

Citations

Tags

Resources

A Deep Dive into scRNA-seq Foundation Models

Overview

Key Features

Technical Details

Applications

Impact

Citations

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Deeper evaluation of a single-cell foundation model

Metrics

GitHub

Citations

Tags

Resources