bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Single-cell

TxFM

Recursion Pharmaceuticals

A self-supervised masked autoencoder for RNA-seq count data, pretrained on 1.4M public samples to learn transferable transcriptomic representations without per-dataset re-training.

Released: May 2026

TxFM (developed and released under the code name OpenTxFM) is a transcriptomics foundation model from Recursion Pharmaceuticals that learns reusable representations of bulk and single-cell RNA-seq data through self-supervised pretraining. Rather than training a bespoke model for each new dataset, TxFM is designed as an inductive learner: once pretrained, it produces embeddings for previously unseen samples without any per-dataset re-training, making it a drop-in feature extractor for downstream transcriptomic analysis.

The core idea, described in the paper "Effective Biological Representation Learning by Masking Gene Expression," is that masking gene expression counts and reconstructing them is a sufficient and surprisingly effective pretraining objective for RNA-seq. The authors report that this masked-autoencoder approach yields representations that outperform several existing transcriptomics foundation models trained on corpora more than 100x larger, suggesting that data curation and a well-matched objective can matter more than raw corpus size.

TxFM was introduced by Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, and colleagues at Recursion, and presented at the ICLR 2026 Workshop on Foundation Models for Science. The accompanying code is being released openly via the OpenTxFM repository.

#Key Features

  • Masked gene-expression pretraining: TxFM is trained as a masked autoencoder over RNA-seq count vectors, learning to reconstruct held-out gene expression values from the observed context — a self-supervised objective requiring no labels.
  • Inductive, transferable representations: The pretrained model generalizes to new datasets and embeds unseen samples directly, eliminating the per-dataset re-training step required by transductive approaches.
  • Data efficiency over scale: The authors report that TxFM matches or exceeds transcriptomics foundation models pretrained on corpora more than 100x larger, emphasizing curated data quality and objective design over sheer corpus size.
  • Curated pretraining corpus: Pretraining uses DiverseRNA-1.4M, a collection of roughly 1.4 million curated public RNA-seq samples spanning diverse biological contexts.
  • Open release: Code is being published as OpenTxFM to support reproducibility and community use.

#Technical Details

TxFM is a self-supervised masked autoencoder operating directly on RNA-seq count data, with a transformer-style encoder learning contextual representations of gene expression. The training objective masks a subset of gene expression values and tasks the model with reconstructing them, analogous to masked-token objectives in language modeling but adapted to the sparse, high-dimensional, count-valued structure of transcriptomic profiles. Pretraining is performed on DiverseRNA-1.4M, a curated set of approximately 1.4 million public RNA-seq samples assembled to span a broad range of tissues and conditions. In the paper's evaluations, the resulting representations outperform transcriptomics foundation models pretrained on substantially larger corpora across downstream benchmarks, supporting the claim that the masking objective combined with careful curation is more important than corpus scale. As the work is a 2026 workshop paper, these results should be read as recent and preliminary relative to peer-reviewed venues.

#Applications

TxFM is intended as a general-purpose representation extractor for transcriptomic data, useful to computational biologists and drug-discovery teams who need transferable embeddings for tasks such as sample characterization, perturbation and phenotype analysis, and clustering or classification of RNA-seq profiles. Because the model is inductive, practitioners can embed new datasets without retraining, lowering the barrier to applying foundation-model representations in routine analysis pipelines and large-scale screening workflows of the kind central to Recursion's platform.

#Impact

TxFM contributes to an ongoing debate in the single-cell and transcriptomics foundation-model community about whether bigger pretraining corpora reliably yield better biological representations. By reporting competitive or superior performance against models trained on more than 100x more data, the work argues that objective design and data curation can outweigh scale — a finding with practical implications for groups without access to massive compute or data. As of release, the OpenTxFM repository is still under construction and no pretrained checkpoint has been posted yet, so weight availability is pending; users should expect to track the repository for the model weights and full reproduction code. This honest limitation aside, the open-code commitment and the data-efficiency result make TxFM a noteworthy entry in the transcriptomics foundation-model landscape.

Citation

Preprint

DOI: 10.48550/arXiv.2605.31562

DOI: 10.48550/arXiv.2605.31562

Openness

Unclassified
Restrictive license on core components

Tags

autoencoderfoundation_modelgene_expressionrepresentation_learningself_supervisedtranscriptomicstransformer

Resources

GitHub RepositoryResearch Paper