OncoBERT

BERT-style language model for somatic mutations, pretrained on cancer sequencing from 210,000+ patients for tumor subtyping and therapy response.

Released: February 2026

OncoBERT is a language model that learns contextual representations of somatic mutations from large-scale clinical cancer sequencing data. Somatic mutation profiling is central to cancer diagnosis and treatment selection, but most clinical interpretation focuses on individual actionable mutations, overlooking the broader mutational context that shapes tumor evolution and treatment response. OncoBERT instead treats a tumor's mutation profile much like a sentence, applying BERT-style masked-language-model pretraining to capture how mutations co-occur and interact.

Developed by researchers at the National Cancer Institute and posted to bioRxiv in February 2026, OncoBERT was pretrained on clinical sequencing spanning more than 210,000 patients, 113 cancer types, and 20 institutions. From these learned representations it identifies robust, patient-specific mutational subtypes that transfer across diverse cohorts and different targeted sequencing panels.

These mutational subtypes are reported to be clinically meaningful: they associate with differential response to chemotherapy, targeted therapies, and immunotherapy, positioning OncoBERT as a foundation for context-aware interpretation of tumor genomes in precision oncology.

Key Features

Context-aware mutation modeling: Learns contextual embeddings of somatic mutations rather than scoring mutations in isolation, capturing the mutational context that shapes tumor behavior.
Large clinical pretraining: Trained on real-world sequencing from >210,000 patients across 113 cancer types and 20 institutions.
Cross-cohort subtyping: Recovers patient-specific mutational subtypes that generalize across cohorts and across different targeted sequencing panels.
Therapy-response association: Subtypes are linked to differential response to chemotherapy, targeted therapy, and immunotherapy, and improve prediction when combined with TMB and MSI biomarkers.

Technical Details

OncoBERT adapts the BERT transformer architecture to somatic mutation data, learning contextual representations from clinical sequencing of more than 210,000 patients, 113 cancer types, and 20 institutions. The authors show that its representations yield patient-specific mutational subtypes that are robust across cohorts and panels, and that integrating these representations with clinically approved immunotherapy biomarkers—tumor mutational burden (TMB) and microsatellite instability (MSI)—significantly improves prediction of clinical benefit. Incorporating matched tumor transcriptomic profiles further links the mutational subtypes to distinct cancer hallmark programs and tumor microenvironment states. As of this preprint, no public code repository or trained weights are available.

Applications

OncoBERT is aimed at precision oncology, supporting patient stratification and treatment selection from routine clinical sequencing panels. By assigning tumors to context-aware mutational subtypes and augmenting established biomarkers such as TMB and MSI, it could help oncologists and translational researchers identify patients more or less likely to benefit from specific chemotherapies, targeted agents, or immunotherapies, and connect genomic patterns to underlying tumor biology when transcriptomic data are available.

Impact

OncoBERT demonstrates that large-scale, multi-institution clinical sequencing can be leveraged with language-model pretraining to derive clinically meaningful, context-aware mutational subtypes that improve on single-mutation interpretation and complement approved biomarkers. Its scale across cancer types and institutions is a notable strength for generalizability. The main current limitation is practical access: as a February 2026 preprint, results await peer review and external validation, and no code or trained weights have been released, so independent reproduction is not yet possible.

Citation

OncoBERT: Context-Aware Modeling of Somatic Mutations for Precision Oncology

Patkar, S., et al. (2026) OncoBERT: Context-Aware Modeling of Somatic Mutations for Precision Oncology. bioRxiv.

DOI: 10.64898/2026.02.18.706658

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References78

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

7Closed

Usability — can I run it?7

Reproducibility — can I retrain it?3

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Context-aware mutation modeling: Learns contextual embeddings of somatic mutations rather than scoring mutations in isolation, capturing the mutational context that shapes tumor behavior.

Large clinical pretraining: Trained on real-world sequencing from >210,000 patients across 113 cancer types and 20 institutions.

Cross-cohort subtyping: Recovers patient-specific mutational subtypes that generalize across cohorts and across different targeted sequencing panels.

Therapy-response association: Subtypes are linked to differential response to chemotherapy, targeted therapy, and immunotherapy, and improve prediction when combined with TMB and MSI biomarkers.

Technical Details

Applications

Impact

OncoBERT

Key Features

Technical Details

Applications

Impact

Citation

OncoBERT: Context-Aware Modeling of Somatic Mutations for Precision Oncology

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

OncoBERT

Key Features

Technical Details

Applications

Impact

Citation

OncoBERT: Context-Aware Modeling of Somatic Mutations for Precision Oncology

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

OncoBERT

#Key Features

#Technical Details

#Applications

#Impact

Citation

OncoBERT: Context-Aware Modeling of Somatic Mutations for Precision Oncology

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

OncoBERT

#Key Features

#Technical Details

#Applications

#Impact

Citation

OncoBERT: Context-Aware Modeling of Somatic Mutations for Precision Oncology

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact