Overview

ConvergeCELL is a virtual cell foundation model developed by Converge Bio, introduced in a May 2026 bioRxiv preprint. It addresses one of the central bottlenecks in translational genomics: the difficulty of converting large, noisy transcriptomic datasets into actionable therapeutic hypotheses. While single-cell RNA sequencing has expanded dramatically in scope, most drug discovery programs still struggle to reliably identify targets and biomarkers from patient cohorts that are too small or technically heterogeneous for robust analysis alone. ConvergeCELL is designed to close that gap by providing patient-level representations learned from a massive, disease-diverse pretraining corpus.

The model was pretrained on over 23 million cells drawn from more than 5,000 patient samples, 550 studies, and 40 clinical indications — one of the largest and most clinically diverse training sets reported for a single-cell foundation model. This breadth allows the model to capture cell-state and patient-level gene expression patterns that generalize across diseases, cohorts, and data modalities. Rather than treating each study in isolation, ConvergeCELL learns a shared representation space that links molecular signals to clinical context, enabling the platform to prioritize drug targets and biomarkers that are biologically grounded and clinically relevant.

A key differentiator is the model's end-to-end design as a discovery platform, not merely an embedding tool. The preprint demonstrates that when applied to independent disease cohorts — lupus, multiple myeloma, and sepsis — ConvergeCELL ranks approved or clinically validated targets (TNFSF13B for lupus, TNFRSF17/BCMA and CXCR4 for myeloma) within the top 0.3% of its gene-level rankings, outperforming comparable computational approaches.

Key Features

Patient-level representation learning: A cell encoder and aggregator architecture learns unified per-patient transcriptomic representations using supervised contrastive learning, capturing cell-type composition and expression patterns simultaneously rather than averaging across cells.
Transfer learning for small cohorts: By pretraining on a large, diverse atlas, ConvergeCELL can stabilize and denoise new single-cell datasets from limited patient samples via transfer learning, extending the utility of studies too small for traditional differential expression analysis.
Bulk RNA-seq compatibility: A distilled bulk module (available as a separate HuggingFace model) encodes conventional bulk RNA-seq samples into the same representation space learned from single-cell data, enabling bulk cohorts — far more common in clinical repositories — to benefit from single-cell-informed representations.
Multimodal input support: The platform accepts single-cell or bulk transcriptomics as well as proteomics data, integrating customer datasets with Converge's proprietary Patient Matrix for cross-modal analysis.
Target and biomarker ranking: A classification head links patient representations to clinical outcomes and extracts disease-specific gene-association scores alongside cell-type–gene importance metrics, delivering a prioritized list of therapeutic candidates.
Pathway and literature linkage: Discoveries are automatically connected to curated biological pathways and literature, facilitating rapid biological interpretation of ranked targets.

Technical Details

ConvergeCELL uses a transformer-based architecture organized into a cell encoder, a patient-level aggregator, and a classification head. The cell encoder converts per-cell gene expression profiles into token representations, and the aggregator pools across all cells from a patient sample to form a compact patient embedding. Supervised contrastive learning is applied during pretraining to ensure that patients sharing clinical characteristics (disease state, treatment response) cluster together in the representation space. The pretrained patient model and a distilled bulk model are released on HuggingFace under the Apache 2.0 license. Exact parameter counts are not reported in the preprint. Pretraining data spans more than 23 million cells from 4,479 to 5,000 patients (the paper and abstract report slightly differing numbers), covering 550 studies and 40 clinical indications. Validation benchmarks across three disease areas — lupus, multiple myeloma, and sepsis — show the model ranks known drug targets (belimumab's target TNFSF13B, belantamab's target BCMA, and the myeloma mobilization target CXCR4) in the top 0.3% of all gene rankings.

Applications

ConvergeCELL is targeted at pharmaceutical and biotech teams conducting target identification and biomarker discovery. It is particularly valuable in settings where patient cohorts are small or technically heterogeneous — common in rare diseases or early clinical programs — where the pretrained atlas provides the statistical power that individual studies lack. Computational biologists can apply the platform to compare expression profiles across patient subgroups, identify genes linked to treatment response, and separate patient-specific noise from mechanistic disease signals. The bulk distillation module means that organizations with legacy bulk RNA-seq cohorts can immediately access single-cell-informed representations without re-collecting data.

Impact

ConvergeCELL represents an emerging class of disease-focused virtual cell models that prioritize clinical translatability over general-purpose biological understanding. Its pretraining across 40 clinical indications and direct validation against approved drugs positions it as one of the more directly application-ready single-cell foundation models described to date. The release of pretrained weights on HuggingFace under a permissive license lowers the adoption barrier for academic and industry groups. As a preprint from May 2026, the work has not yet undergone peer review, and independent replication of the target-ranking benchmarks across additional disease areas will be important for establishing the model's generality. The model adds to a growing landscape of patient-centric single-cell foundation models alongside approaches like scGPT and Geneformer, with a distinguishing focus on disease cohort representation and pharmaceutical target prioritization.

Overview

Key Features

Patient-level representation learning: A cell encoder and aggregator architecture learns unified per-patient transcriptomic representations using supervised contrastive learning, capturing cell-type composition and expression patterns simultaneously rather than averaging across cells.

Transfer learning for small cohorts: By pretraining on a large, diverse atlas, ConvergeCELL can stabilize and denoise new single-cell datasets from limited patient samples via transfer learning, extending the utility of studies too small for traditional differential expression analysis.

Bulk RNA-seq compatibility: A distilled bulk module (available as a separate HuggingFace model) encodes conventional bulk RNA-seq samples into the same representation space learned from single-cell data, enabling bulk cohorts — far more common in clinical repositories — to benefit from single-cell-informed representations.

Multimodal input support: The platform accepts single-cell or bulk transcriptomics as well as proteomics data, integrating customer datasets with Converge's proprietary Patient Matrix for cross-modal analysis.

Target and biomarker ranking: A classification head links patient representations to clinical outcomes and extracts disease-specific gene-association scores alongside cell-type–gene importance metrics, delivering a prioritized list of therapeutic candidates.

Pathway and literature linkage: Discoveries are automatically connected to curated biological pathways and literature, facilitating rapid biological interpretation of ranked targets.

Technical Details

Applications

Impact

ConvergeCELL

Overview

Key Features

Technical Details

Applications

Impact

Citation

Tags

Resources

ConvergeCELL

Overview

Key Features

Technical Details

Applications

Impact

Citation

Tags

Resources