4.9 billion parameter diffusion language model for predicting genome-wide genetic perturbation responses, trained on the largest CRISPRi Perturb-seq dataset built to date.
X-Cell is a 4.9-billion-parameter diffusion language model for predicting genome-wide genetic perturbation responses in single cells, developed by Xaira Therapeutics and announced in March 2026 with an accompanying bioRxiv preprint. It is trained on X-Atlas/Pisces, the largest CRISPRi Perturb-seq corpus assembled to date, comprising 25.6 million perturbed single-cell transcriptomes spanning thousands of gene knockdowns across multiple cell contexts.
X-Cell is notable for being the first virtual-cell model to demonstrate clear scaling laws in the perturbation domain — performance on out-of-distribution gene knockdowns improves predictably with both data and parameter count, mirroring the scaling behavior observed in language models. This puts CRISPR perturbation modeling on a similar empirical footing to natural-language modeling and supports the case for continued investment in larger Perturb-seq datasets.
X-Cell uses a transformer backbone adapted for token-like representations of gene-expression count vectors, with a discrete diffusion forward process that masks and reconstructs gene expression conditional on perturbation identity and baseline cell state. The model is trained on Xaira's proprietary X-Atlas/Pisces corpus, which combines published Perturb-seq datasets with substantial in-house data generation. Training was performed on standard transformer infrastructure; full hyperparameters are reported in the bioRxiv preprint.
The model is benchmarked on held-out perturbation prediction (predicting the expression response to knockdowns not seen during training), held-out cell-context prediction, and downstream drug-target prioritization. X-Cell outperforms scGPT, Geneformer, and prior task-specific perturbation models at the largest scales tested.
X-Cell is designed for in silico target prioritization in early drug discovery. Pharma teams can rank candidate genetic targets by their predicted phenotypic effect in disease-relevant cell contexts before committing wet-lab resources. The model also supports counterfactual reasoning — asking how a cell would respond to a perturbation it has never been measured under — which is critical for novel target nomination.
X-Cell raises the ceiling on what foundation models can do for genetic perturbation prediction and reframes the perturbation-modeling problem as one that scales with data, compute, and parameters in the same way language modeling does. The model is the largest causal perturbation model built to date, and the X-Atlas/Pisces corpus it was trained on is itself a notable contribution. Xaira has not committed to fully open-sourcing the weights, though the preprint is public and reports detailed methodology. The work is likely to motivate larger Perturb-seq data-generation projects across the academic and commercial sectors.
Wang, C., et al. (2026) X-Cell: Scaling Causal Perturbation Prediction Across Diverse Cellular Contexts via Diffusion Language Models. bioRxiv.
DOI: 10.64898/2026.03.18.712807