Overview

BioMed Multi-Omic (BMFM-MultiOmic) is an open-source framework developed by IBM Research for building, training, and evaluating foundation models for both single-cell RNA transcriptomics and DNA sequence data. Rather than delivering a single pretrained model, BMFM-MultiOmic provides a modular software package that unifies diverse pretraining strategies, architectures, and fine-tuning objectives under a common declarative interface, enabling researchers to configure, train, and benchmark genomic and transcriptomic foundation models reproducibly. Two companion preprints published in 2025 describe the framework's two main model families: BMFM-RNA (arXiv:2506.14861) for transcriptomic foundation models, and BMFM-DNA (arXiv:2507.05265) for SNP-aware DNA sequence models.

BMFM-RNA addresses a reproducibility and comparability problem that has hampered progress in single-cell foundation models: different research groups use different training objectives, data preprocessing pipelines, and evaluation protocols, making it difficult to determine whether performance differences reflect genuine model improvements or experimental choices. The framework introduces a novel pretraining objective called the Whole-Cell Expression Decoder (WCED) — an autoencoder-like training scheme that forces the model's CLS token to encode a complete cellular state by predicting full-cell expression profiles from partial gene-expression inputs. Evaluated on over a dozen benchmark datasets using a standardized pipeline, BMFM-RNA models trained with WCED match or exceed scGPT in both zero-shot and fine-tuning tasks while using dramatically less training data through intelligent data selection strategies.

BMFM-DNA addresses a distinct problem: most existing DNA language models ignore natural genetic variation, training instead on a single reference genome and treating all humans as genetically identical. The framework introduces BMFM-DNA-SNP, which encodes known single nucleotide polymorphisms (SNPs) from dbSNP directly into pretraining sequences using a population-frequency-aware representation, allowing the model to learn how genetic variants affect regulatory grammar without expanding sequence length.

Key Features

Whole-Cell Expression Decoder (WCED): A novel pretraining objective that uses a CLS token bottleneck to force the model to encode complete cellular state, predicting full gene expression profiles from partial inputs — outperforming standard masked language modeling for transcriptomic data.
SNP-aware DNA pretraining: BMFM-DNA-SNP encodes genetic variants from dbSNP using a population-frequency-aware character representation that integrates 20 million variant positions without expanding sequence length, enabling the model to learn variant effects implicitly during pretraining.
ModernBERT backbone for DNA models: The DNA family uses ModernBERT — a modernized BERT incorporating Rotary Positional Embeddings, GeGLU activation, and FlashAttention — achieving efficient pretraining on sequences up to 2,048 tokens covering 1 kb–10 kb genomic windows.
Modular, declarative configuration: The framework supports multiple architectures (BERT, ModernBERT, Performer, Nyströmformer) and training objectives through YAML-based no-code configuration, enabling systematic ablation studies and reproducible benchmarking without modifying code.
Intelligent data selection for transcriptomics: Demonstrates that models trained on 1% of CELLxGENE data with round-robin sampling by cell type achieve performance matching models trained on 10–100 times more data, substantially reducing compute requirements.
Comprehensive evaluation suite: Bundles benchmark pipelines covering zero-shot clustering, cell type annotation, batch integration, and gene expression prediction — with external baselines (scGPT, scib-metrics, Harmony) included for direct comparison.

Technical Details

BMFM-RNA models are 110-million-parameter BERT-style transformer encoders with 12 layers, 768-dimensional hidden states, and multi-field input handling that jointly tokenizes gene identifiers, expression values, and cell metadata. Pretraining datasets include PanglaoDB (over 1 million human cells from 74 tissues) and CELLxGENE (harmonized single-cell experiments). Four model variants are released, differing in pretraining objective: MLM with read-depth-aware data augmentation (MLM+RDA), MLM with multitask learning, WCED alone, and WCED with multitask learning. The WCED+Multitask model achieves the strongest zero-shot clustering performance — 75.6% weighted average across benchmark datasets, 3.7 percentage points above scGPT — despite being trained on only 284,501 cells (1% of CELLxGENE) rather than the millions used by competing models. Captum-based interpretability tools are included for gene-level attribution analysis.

BMFM-DNA models are 113-million-parameter transformers using ModernBERT as the backbone architecture, with 22 hidden layers, 12 attention heads, 768-dimensional embeddings, and a maximum sequence length of 2,048 nucleotide tokens. Two variants are released: BMFM-DNA-REF trained on human reference genome (GRCh38) sequences alone, and BMFM-DNA-SNP trained on sequences augmented with 20 million dbSNP variants encoded using a novel character scheme in which each SNP position is represented by a character (drawn from classical Chinese poem Li Sao) whose probability distribution over possible nucleotides reflects population allele frequencies. Pretraining uses approximately 10 million sequences of 1–10 kb from GRCh38, totaling roughly 60 billion nucleotides including reverse complements. BMFM-DNA-SNP achieves 93.5 F1 on promoter detection, 82.42 F1 on transcription factor binding prediction, and 90.0 AUC on SNP-disease association tasks, matching or exceeding DNABERT-2 despite training for 150,000 steps on a single species rather than the multi-species, longer-training regime used by DNABERT-2.

Applications

BMFM-MultiOmic serves two distinct research communities through its dual focus. For single-cell biologists, the BMFM-RNA framework provides a reproducible pipeline for training and evaluating transcriptomic foundation models on custom datasets, with particular utility for labs working on rare cell types, underrepresented tissues, or non-standard species where existing pretrained models may not transfer well. The WCED pretraining objective and intelligent data selection strategies mean that high-quality models can be trained with substantially fewer cells than competing approaches — a practical advantage when building disease-specific models from patient cohorts. For genomics researchers, BMFM-DNA-SNP provides a first foundation model explicitly designed to incorporate common genetic variation during pretraining rather than only at fine-tuning time, making it directly applicable to tasks like variant effect scoring, eQTL prediction, GWAS prioritization, and regulatory element annotation where genetic diversity is the primary signal of interest.

Impact

BMFM-MultiOmic represents IBM Research's contribution to the growing effort to standardize the development and evaluation of biological foundation models. By releasing a framework rather than just a model checkpoint, the team addresses the reproducibility crisis in single-cell and genomic AI: different groups training on the same data with different preprocessing choices have reported dramatically different performance numbers, making progress difficult to assess. The unified YAML-driven configuration system and bundled benchmark suite provide the community with the infrastructure needed to conduct fair comparisons. The WCED pretraining objective is a genuine methodological contribution that challenges the assumption that masked language modeling is the optimal self-supervised objective for transcriptomic data, and its competitive performance at 1% training data scale has direct implications for the economics of training large single-cell models. Pretrained checkpoints for both RNA and DNA model families are publicly available on HuggingFace. As with all foundation models trained on human data, performance on non-human species or non-standard experimental protocols should be validated before deployment.

Overview

Key Features

Whole-Cell Expression Decoder (WCED): A novel pretraining objective that uses a CLS token bottleneck to force the model to encode complete cellular state, predicting full gene expression profiles from partial inputs — outperforming standard masked language modeling for transcriptomic data.

SNP-aware DNA pretraining: BMFM-DNA-SNP encodes genetic variants from dbSNP using a population-frequency-aware character representation that integrates 20 million variant positions without expanding sequence length, enabling the model to learn variant effects implicitly during pretraining.

ModernBERT backbone for DNA models: The DNA family uses ModernBERT — a modernized BERT incorporating Rotary Positional Embeddings, GeGLU activation, and FlashAttention — achieving efficient pretraining on sequences up to 2,048 tokens covering 1 kb–10 kb genomic windows.

Modular, declarative configuration: The framework supports multiple architectures (BERT, ModernBERT, Performer, Nyströmformer) and training objectives through YAML-based no-code configuration, enabling systematic ablation studies and reproducible benchmarking without modifying code.

Intelligent data selection for transcriptomics: Demonstrates that models trained on 1% of CELLxGENE data with round-robin sampling by cell type achieve performance matching models trained on 10–100 times more data, substantially reducing compute requirements.

Comprehensive evaluation suite: Bundles benchmark pipelines covering zero-shot clustering, cell type annotation, batch integration, and gene expression prediction — with external baselines (scGPT, scib-metrics, Harmony) included for direct comparison.

Technical Details

Applications

Impact

BioMed Multi-Omic

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

BioMed Multi-Omic

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources