A virtual cell foundation model pretrained on 23M+ cells from 5,000 patient samples for drug target and biomarker discovery.
ConvergeCELL is a virtual cell foundation model developed by Converge Bio, introduced in a May 2026 bioRxiv preprint. It addresses one of the central bottlenecks in translational genomics: the difficulty of converting large, noisy transcriptomic datasets into actionable therapeutic hypotheses. While single-cell RNA sequencing has expanded dramatically in scope, most drug discovery programs still struggle to reliably identify targets and biomarkers from patient cohorts that are too small or technically heterogeneous for robust analysis alone. ConvergeCELL is designed to close that gap by providing patient-level representations learned from a massive, disease-diverse pretraining corpus.
The model was pretrained on over 23 million cells drawn from more than 5,000 patient samples, 550 studies, and 40 clinical indications — one of the largest and most clinically diverse training sets reported for a single-cell foundation model. This breadth allows the model to capture cell-state and patient-level gene expression patterns that generalize across diseases, cohorts, and data modalities. Rather than treating each study in isolation, ConvergeCELL learns a shared representation space that links molecular signals to clinical context, enabling the platform to prioritize drug targets and biomarkers that are biologically grounded and clinically relevant.
A key differentiator is the model's end-to-end design as a discovery platform, not merely an embedding tool. The preprint demonstrates that when applied to independent disease cohorts — lupus, multiple myeloma, and sepsis — ConvergeCELL ranks approved or clinically validated targets (TNFSF13B for lupus, TNFRSF17/BCMA and CXCR4 for myeloma) within the top 0.3% of its gene-level rankings, outperforming comparable computational approaches.
ConvergeCELL uses a transformer-based architecture organized into a cell encoder, a patient-level aggregator, and a classification head. The cell encoder converts per-cell gene expression profiles into token representations, and the aggregator pools across all cells from a patient sample to form a compact patient embedding. Supervised contrastive learning is applied during pretraining to ensure that patients sharing clinical characteristics (disease state, treatment response) cluster together in the representation space. The pretrained patient model and a distilled bulk model are released on HuggingFace under the Apache 2.0 license. Exact parameter counts are not reported in the preprint. Pretraining data spans more than 23 million cells from 4,479 to 5,000 patients (the paper and abstract report slightly differing numbers), covering 550 studies and 40 clinical indications. Validation benchmarks across three disease areas — lupus, multiple myeloma, and sepsis — show the model ranks known drug targets (belimumab's target TNFSF13B, belantamab's target BCMA, and the myeloma mobilization target CXCR4) in the top 0.3% of all gene rankings.
ConvergeCELL is targeted at pharmaceutical and biotech teams conducting target identification and biomarker discovery. It is particularly valuable in settings where patient cohorts are small or technically heterogeneous — common in rare diseases or early clinical programs — where the pretrained atlas provides the statistical power that individual studies lack. Computational biologists can apply the platform to compare expression profiles across patient subgroups, identify genes linked to treatment response, and separate patient-specific noise from mechanistic disease signals. The bulk distillation module means that organizations with legacy bulk RNA-seq cohorts can immediately access single-cell-informed representations without re-collecting data.
ConvergeCELL represents an emerging class of disease-focused virtual cell models that prioritize clinical translatability over general-purpose biological understanding. Its pretraining across 40 clinical indications and direct validation against approved drugs positions it as one of the more directly application-ready single-cell foundation models described to date. The release of pretrained weights on HuggingFace under a permissive license lowers the adoption barrier for academic and industry groups. As a preprint from May 2026, the work has not yet undergone peer review, and independent replication of the target-ranking benchmarks across additional disease areas will be important for establishing the model's generality. The model adds to a growing landscape of patient-centric single-cell foundation models alongside approaches like scGPT and Geneformer, with a distinguishing focus on disease cohort representation and pharmaceutical target prioritization.