IBM Research
Open-source framework for building RNA and DNA foundation models, featuring WCED pretraining for transcriptomics and SNP-aware encoding for genomics.
BioMed Multi-Omic (BMFM-MultiOmic) is an open-source framework developed by IBM Research for building, training, and evaluating foundation models for both single-cell RNA transcriptomics and DNA sequence data. Rather than delivering a single pretrained model, BMFM-MultiOmic provides a modular software package that unifies diverse pretraining strategies, architectures, and fine-tuning objectives under a common declarative interface, enabling researchers to configure, train, and benchmark genomic and transcriptomic foundation models reproducibly. Two companion preprints published in 2025 describe the framework's two main model families: BMFM-RNA (arXiv:2506.14861) for transcriptomic foundation models, and BMFM-DNA (arXiv:2507.05265) for SNP-aware DNA sequence models.
BMFM-RNA addresses a reproducibility and comparability problem that has hampered progress in single-cell foundation models: different research groups use different training objectives, data preprocessing pipelines, and evaluation protocols, making it difficult to determine whether performance differences reflect genuine model improvements or experimental choices. The framework introduces a novel pretraining objective called the Whole-Cell Expression Decoder (WCED) — an autoencoder-like training scheme that forces the model's CLS token to encode a complete cellular state by predicting full-cell expression profiles from partial gene-expression inputs. Evaluated on over a dozen benchmark datasets using a standardized pipeline, BMFM-RNA models trained with WCED match or exceed scGPT in both zero-shot and fine-tuning tasks while using dramatically less training data through intelligent data selection strategies.
BMFM-DNA addresses a distinct problem: most existing DNA language models ignore natural genetic variation, training instead on a single reference genome and treating all humans as genetically identical. The framework introduces BMFM-DNA-SNP, which encodes known single nucleotide polymorphisms (SNPs) from dbSNP directly into pretraining sequences using a population-frequency-aware representation, allowing the model to learn how genetic variants affect regulatory grammar without expanding sequence length.
BMFM-RNA models are 110-million-parameter BERT-style transformer encoders with 12 layers, 768-dimensional hidden states, and multi-field input handling that jointly tokenizes gene identifiers, expression values, and cell metadata. Pretraining datasets include PanglaoDB (over 1 million human cells from 74 tissues) and CELLxGENE (harmonized single-cell experiments). Four model variants are released, differing in pretraining objective: MLM with read-depth-aware data augmentation (MLM+RDA), MLM with multitask learning, WCED alone, and WCED with multitask learning. The WCED+Multitask model achieves the strongest zero-shot clustering performance — 75.6% weighted average across benchmark datasets, 3.7 percentage points above scGPT — despite being trained on only 284,501 cells (1% of CELLxGENE) rather than the millions used by competing models. Captum-based interpretability tools are included for gene-level attribution analysis.
BMFM-DNA models are 113-million-parameter transformers using ModernBERT as the backbone architecture, with 22 hidden layers, 12 attention heads, 768-dimensional embeddings, and a maximum sequence length of 2,048 nucleotide tokens. Two variants are released: BMFM-DNA-REF trained on human reference genome (GRCh38) sequences alone, and BMFM-DNA-SNP trained on sequences augmented with 20 million dbSNP variants encoded using a novel character scheme in which each SNP position is represented by a character (drawn from classical Chinese poem Li Sao) whose probability distribution over possible nucleotides reflects population allele frequencies. Pretraining uses approximately 10 million sequences of 1–10 kb from GRCh38, totaling roughly 60 billion nucleotides including reverse complements. BMFM-DNA-SNP achieves 93.5 F1 on promoter detection, 82.42 F1 on transcription factor binding prediction, and 90.0 AUC on SNP-disease association tasks, matching or exceeding DNABERT-2 despite training for 150,000 steps on a single species rather than the multi-species, longer-training regime used by DNABERT-2.
BMFM-MultiOmic serves two distinct research communities through its dual focus. For single-cell biologists, the BMFM-RNA framework provides a reproducible pipeline for training and evaluating transcriptomic foundation models on custom datasets, with particular utility for labs working on rare cell types, underrepresented tissues, or non-standard species where existing pretrained models may not transfer well. The WCED pretraining objective and intelligent data selection strategies mean that high-quality models can be trained with substantially fewer cells than competing approaches — a practical advantage when building disease-specific models from patient cohorts. For genomics researchers, BMFM-DNA-SNP provides a first foundation model explicitly designed to incorporate common genetic variation during pretraining rather than only at fine-tuning time, making it directly applicable to tasks like variant effect scoring, eQTL prediction, GWAS prioritization, and regulatory element annotation where genetic diversity is the primary signal of interest.
BMFM-MultiOmic represents IBM Research's contribution to the growing effort to standardize the development and evaluation of biological foundation models. By releasing a framework rather than just a model checkpoint, the team addresses the reproducibility crisis in single-cell and genomic AI: different groups training on the same data with different preprocessing choices have reported dramatically different performance numbers, making progress difficult to assess. The unified YAML-driven configuration system and bundled benchmark suite provide the community with the infrastructure needed to conduct fair comparisons. The WCED pretraining objective is a genuine methodological contribution that challenges the assumption that masked language modeling is the optimal self-supervised objective for transcriptomic data, and its competitive performance at 1% training data scale has direct implications for the economics of training large single-cell models. Pretrained checkpoints for both RNA and DNA model families are publicly available on HuggingFace. As with all foundation models trained on human data, performance on non-human species or non-standard experimental protocols should be validated before deployment.