bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
RNA foundation models
RNA

TifBERT

York University

A self-supervised transformer for normalization-robust bulk RNA-seq representation learning, pretrained on harmonized TCGA Pan-Cancer data via TF-IDF gene ordering and masked gene modeling.

Released: June 2026

TifBERT is a self-supervised foundation model for bulk RNA-seq representation learning, developed by Seyedmohsen Hosseini and Divya Sharma at York University and released as a bioRxiv preprint in June 2026. While transformer-based foundation models have proliferated for single-cell transcriptomics, bulk RNA-seq — still the workhorse of translational genomics and large clinical cohorts — has received far less attention. TifBERT addresses that gap with a model designed to produce reusable, normalization-robust representations of whole-transcriptome bulk expression profiles.

The central innovation is how TifBERT turns an inherently unordered expression profile into something a transformer can consume. Rather than discretizing expression into bins, reconstructing numerical values, or restricting attention to a landmark gene panel, it converts each sample into a sample-specific gene sequence using term frequency-inverse document frequency (TF-IDF) ordering. This prioritizes genes that are both highly expressed within a sample and selectively expressed across the cohort. The model is then pretrained with a masked gene modeling objective that predicts gene identities from transcriptomic context — learning relationships between genes rather than memorizing absolute expression magnitudes.

By avoiding expression binning, landmark-gene restriction, and external gene embeddings, TifBERT aims to be robust to the normalization scheme used upstream, a persistent source of irreproducibility when combining RNA-seq cohorts.

#Key Features

  • TF-IDF gene ordering: Converts each unordered expression profile into a sample-specific token sequence by ranking genes on within-sample expression and cross-cohort selectivity, sidestepping the need for expression discretization.
  • Masked gene modeling: A BERT-style self-supervised objective predicts masked gene identities from context, learning contextual gene relationships instead of reconstructing raw expression values.
  • Normalization robustness: Pretrained across five RNA-seq normalization schemes so representations transfer across heterogeneous cohorts without re-tuning to a specific pipeline.
  • Zero-shot tissue transfer: Applied to independent GTEx healthy tissues without retraining, it preserves tissue-level transcriptomic structure, demonstrating generalization beyond its cancer training distribution.
  • Rich embedding geometry: Produces substantially higher effective rank (95.6 versus 6.3 for a compared model), indicating more expressive, less-collapsed representations.

#Technical Details

TifBERT is a transformer encoder pretrained on harmonized TCGA Pan-Cancer bulk RNA-seq spanning approximately 10,000 genes, 33 cancer types, and five RNA-seq normalization schemes, using masked gene modeling over TF-IDF-ordered gene sequences. On TCGA cancer type classification across the 33 types, it reports 90.83% accuracy, 0.996 macro AUC-ROC, and 0.903 Matthews correlation coefficient. It also captures pathway-level biology, with mean sample-wise and pathway-wise Pearson correlations of 0.754 and 0.762 across 1,387 PARADIGM pathway activities. The reference implementation (Python and Jupyter notebooks) loads a fixed model.safetensors checkpoint at inference time. Note that as of the preprint the pretrained weights are not yet publicly released — the inference script hardcodes a local checkpoint path — and the repository carries no license file; portions of the codebase also still use the project's earlier internal name "bulkGeneFormer."

#Applications

TifBERT targets researchers working with bulk transcriptomic cohorts in cancer genomics and translational research, where data are routinely aggregated across studies that used different normalization pipelines. Its representations can support cancer type and subtype classification, pathway activity inference, and general-purpose embedding of samples for downstream analysis. Because it generalizes zero-shot to healthy GTEx tissues, it is also relevant to broader tissue-expression characterization beyond the oncology setting in which it was trained.

#Impact

TifBERT contributes to a growing effort to bring foundation-model methodology to bulk RNA-seq, a modality underserved relative to single-cell data despite its central role in clinical and population-scale genomics. Its TF-IDF ordering plus masked-gene-modeling recipe offers an alternative to discretization- and reconstruction-based approaches, and its emphasis on normalization robustness and stable, high-rank embeddings addresses practical reproducibility barriers in multi-cohort transcriptomics. As a preprint with weights not yet released, its real-world adoption remains to be established, but it stakes out a clear design direction for normalization-independent bulk transcriptomic foundation models.

Citation

TifBERT: a self-supervised foundation model for normalization-robust bulk RNA-seq representation learning

Hosseini, S. & Sharma, D. (2026) TifBERT: a self-supervised foundation model for normalization-robust bulk RNA-seq representation learning. bioRxiv.

DOI: 10.64898/2026.06.08.728683

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References30

GitHub

Stars1
Forks0
Open Issues0
Contributors1
Last Push9d ago
LanguageJupyter Notebook

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
17Closed
Usability — can I run it?11
Reproducibility — can I retrain it?23
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

gene_expressionrepresentation_learningcancer_type_classificationtransformerbertfoundation_modelself_supervisedzero_shottranscriptomics

Resources

GitHub RepositoryResearch Paper