Hangzhou Institute of Medicine, CAS
A multimodal single-cell foundation model with a multiway Transformer that jointly models scRNA-seq and scATAC-seq, including RNA-only, ATAC-only, and paired inputs.
CLM-X is a multimodal single-cell foundation model that jointly models single-cell RNA sequencing (scRNA-seq) and single-cell ATAC sequencing (scATAC-seq) within a single architecture. Transformer-based cell language models (CLMs) have become powerful tools for learning transferable cell representations, but most operate on a single modality. As multimodal single-cell profiling grows, the field has lacked a unified, flexible foundation model that can handle gene-expression and chromatin-accessibility data together—and gracefully handle datasets where only one modality is available.
Developed by Bowen Li and colleagues at the Hangzhou Institute of Medicine, Chinese Academy of Sciences, and posted to bioRxiv in February 2026, CLM-X is built on a multiway Transformer architecture. It uses a harmonized tokenization design and a stage-wise masked-reconstruction pretraining strategy so that RNA-only, ATAC-only, and paired RNA-ATAC inputs can all be processed within one framework. The multiway design lets the model route different modalities through shared and modality-specific pathways, learning representations that transfer across data types.
CLM-X is pretrained on million-scale unimodal and multimodal datasets and evaluated on five downstream tasks across ten benchmark datasets. The work is distributed under a CC BY-NC 4.0 license.
CLM-X is a multiway Transformer foundation model pretrained with a stage-wise masked-reconstruction objective on million-scale unimodal and multimodal single-cell datasets. A harmonized tokenization scheme lets the model encode scRNA-seq and scATAC-seq consistently, while the multiway design supports RNA-only, ATAC-only, and paired RNA-ATAC inputs without separate models. The authors benchmark CLM-X on ten datasets across five tasks—batch correction, modality integration, cross-modal translation, cell type annotation, and perturbation prediction—and report that it consistently outperforms existing multimodal methods and unimodal foundation models, with the clearest gains in RNA-ATAC cross-modal translation and genetic-perturbation-response prediction. The preprint does not disclose a specific parameter count, and code/weights availability is not specified at the time of writing.
CLM-X targets integrative single-cell analysis where researchers combine transcriptomic and epigenomic measurements. Its unified modeling supports common workflows—correcting batch effects, integrating modalities, annotating cell types, translating between RNA and ATAC, and predicting responses to genetic perturbations—and is especially useful when datasets are partially paired or single-modality, a frequent situation in real multimodal atlases. Computational biologists building or analyzing single-cell multi-omic atlases are the primary beneficiaries.
CLM-X extends the single-cell foundation-model paradigm from unimodal expression toward unified RNA-plus-ATAC modeling, addressing an underexplored gap in flexible multimodal pretraining. Its reported advantages in cross-modal translation and perturbation prediction point toward foundation models that reason jointly about gene regulation and expression. As a February 2026 bioRxiv preprint, released code and weights are not yet confirmed, and independent benchmarking against established multimodal integration tools will determine how broadly it is adopted; the CC BY-NC license also restricts commercial use.