A rigorous benchmarking study of scBERT and scGPT for cell type annotation, comparing foundation models against logistic regression baselines.
As foundation models adapted from natural language processing began proliferating in the single-cell biology space, the computational biology community lacked rigorous, apples-to-apples evaluations of how much these large pretrained models actually improve over simpler baselines. "A Deep Dive into Single-Cell RNA Sequencing Foundation Models," by Rebecca Boiarsky and colleagues at MIT CSAIL and the Broad Institute, directly addresses this gap. Published as a bioRxiv preprint in October 2023 and subsequently as a correspondence in Nature Machine Intelligence in December 2024, the study subjects two prominent scRNA-seq foundation models — scBERT and scGPT — to systematic critical evaluation on the task of cell type annotation.
The central finding is striking: for scBERT, a simple L1-regularized logistic regression classifier consistently matches or outperforms the pretrained model on cell type annotation, even in few-shot settings where only a fraction of labeled training data is available. For scGPT, the picture is more nuanced — pretraining does provide a measurable benefit over training from scratch — but the gains depend heavily on hyperparameter choices and random initialization, raising questions about robustness and reproducibility. Ablation experiments further show that removing scBERT's pretraining step does not meaningfully degrade fine-tuning performance, suggesting that the BERT-style masked gene expression pretraining task does not encode representations that transfer usefully to downstream annotation.
This work is best understood not as a standalone model but as a methodological benchmark and a call for higher standards in evaluating biological foundation models. The authors provide fully reproducible code and evaluation pipelines, making it a practical resource for researchers who want to audit new scRNA-seq models against honest baselines before adopting them.
The evaluation framework targets cell type annotation as its primary task, treating it as a multi-class classification problem in which a fine-tuned model assigns cell type labels to individual cells based on their gene expression profiles. scBERT is a BERT-style transformer that uses gene2vec embeddings to represent gene expression values, pretrained with a masked-expression objective on large unlabeled scRNA-seq corpora. scGPT is a GPT-style autoregressive transformer pretrained on over 33 million human cells. Both models are fine-tuned on the labeled datasets described above and compared against an L1-regularized logistic regression trained on the same input features.
The ablation experiments systematically remove or randomize the pretrained weights to measure the isolated contribution of pretraining. Results show that for scBERT, the pretraining task — predicting masked gene expression values from context — does not produce representations that transfer to cell type annotation any better than a randomly initialized or linear model. For scGPT, pretraining does improve performance relative to no-pretraining controls, particularly on the Multiple Sclerosis dataset, though performance is sensitive to fine-tuning hyperparameters and initialization seeds. The study also probes whether gene2vec embeddings embedded in scBERT contribute to performance, finding minimal independent effect. Accuracy and macro-averaged F1 scores are the primary reported metrics.
This benchmark framework is directly applicable to any researcher evaluating new scRNA-seq foundation models before deployment in cell atlas annotation, disease subtype classification, or rare cell type discovery. The comparison protocols and baseline implementations can be adopted to audit future models — such as Geneformer, scFoundation, or CellPLM — under a consistent evaluation scheme. The findings also inform decisions about when the computational overhead of large pretrained models is justified versus when simpler, more interpretable classifiers should be preferred.
This study contributed to a broader critical conversation about the standards for evaluating biological foundation models that emerged across the field in 2023–2024. Its central result — that a logistic regression baseline suffices for scBERT-level cell type annotation — prompted responses from the scBERT development team and was cited in subsequent work examining the zero-shot and few-shot capabilities of scRNA-seq models. The expanded version published in Nature Machine Intelligence in December 2024 brought the findings to a wider audience and helped establish tighter expectations for what "improved performance" means in the single-cell foundation model literature. A key limitation of the study is its focus on cell type annotation as the sole evaluation task; other capabilities claimed by these models, such as gene regulatory network inference or batch integration, are not assessed, leaving open questions about the broader utility of pretraining in single-cell biology.
Boiarsky, R., et al. (2023) A Deep Dive into Single-Cell RNA Sequencing Foundation Models. bioRxiv.
DOI: 10.1101/2023.10.19.563100Boiarsky, R., et al. (2024) Deeper evaluation of a single-cell foundation model. Nature Machine Intelligence.
DOI: 10.1038/s42256-024-00949-w