A self-supervised transformer for normalization-robust bulk RNA-seq representation learning, pretrained on harmonized TCGA Pan-Cancer data via TF-IDF gene ordering and masked gene modeling.
TifBERT is a self-supervised foundation model for bulk RNA-seq representation learning, developed by Seyedmohsen Hosseini and Divya Sharma at York University and released as a bioRxiv preprint in June 2026. While transformer-based foundation models have proliferated for single-cell transcriptomics, bulk RNA-seq — still the workhorse of translational genomics and large clinical cohorts — has received far less attention. TifBERT addresses that gap with a model designed to produce reusable, normalization-robust representations of whole-transcriptome bulk expression profiles.
The central innovation is how TifBERT turns an inherently unordered expression profile into something a transformer can consume. Rather than discretizing expression into bins, reconstructing numerical values, or restricting attention to a landmark gene panel, it converts each sample into a sample-specific gene sequence using term frequency-inverse document frequency (TF-IDF) ordering. This prioritizes genes that are both highly expressed within a sample and selectively expressed across the cohort. The model is then pretrained with a masked gene modeling objective that predicts gene identities from transcriptomic context — learning relationships between genes rather than memorizing absolute expression magnitudes.
By avoiding expression binning, landmark-gene restriction, and external gene embeddings, TifBERT aims to be robust to the normalization scheme used upstream, a persistent source of irreproducibility when combining RNA-seq cohorts.
TifBERT is a transformer encoder pretrained on harmonized TCGA Pan-Cancer bulk
RNA-seq spanning approximately 10,000 genes, 33 cancer types, and five RNA-seq
normalization schemes, using masked gene modeling over TF-IDF-ordered gene
sequences. On TCGA cancer type classification across the 33 types, it reports
90.83% accuracy, 0.996 macro AUC-ROC, and 0.903 Matthews correlation
coefficient. It also captures pathway-level biology, with mean sample-wise and
pathway-wise Pearson correlations of 0.754 and 0.762 across 1,387 PARADIGM
pathway activities. The reference implementation (Python and Jupyter notebooks)
loads a fixed model.safetensors checkpoint at inference time. Note that as of
the preprint the pretrained weights are not yet publicly released — the inference
script hardcodes a local checkpoint path — and the repository carries no license
file; portions of the codebase also still use the project's earlier internal name
"bulkGeneFormer."
TifBERT targets researchers working with bulk transcriptomic cohorts in cancer genomics and translational research, where data are routinely aggregated across studies that used different normalization pipelines. Its representations can support cancer type and subtype classification, pathway activity inference, and general-purpose embedding of samples for downstream analysis. Because it generalizes zero-shot to healthy GTEx tissues, it is also relevant to broader tissue-expression characterization beyond the oncology setting in which it was trained.
TifBERT contributes to a growing effort to bring foundation-model methodology to bulk RNA-seq, a modality underserved relative to single-cell data despite its central role in clinical and population-scale genomics. Its TF-IDF ordering plus masked-gene-modeling recipe offers an alternative to discretization- and reconstruction-based approaches, and its emphasis on normalization robustness and stable, high-rank embeddings addresses practical reproducibility barriers in multi-cohort transcriptomics. As a preprint with weights not yet released, its real-world adoption remains to be established, but it stakes out a clear design direction for normalization-independent bulk transcriptomic foundation models.
Hosseini, S. & Sharma, D. (2026) TifBERT: a self-supervised foundation model for normalization-robust bulk RNA-seq representation learning. bioRxiv.
DOI: 10.64898/2026.06.08.728683Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data