TxFM

Transcriptomics foundation model from Recursion that masks and reconstructs RNA-seq gene expression counts to learn reusable sample embeddings.

Released: May 2026

TxFM (developed and released under the code name OpenTxFM) is a transcriptomics foundation model from Recursion Pharmaceuticals that learns reusable representations of bulk and single-cell RNA-seq data through self-supervised pretraining. Rather than training a bespoke model for each new dataset, TxFM is designed as an inductive learner: once pretrained, it produces embeddings for previously unseen samples without any per-dataset re-training, making it a drop-in feature extractor for downstream transcriptomic analysis.

The core idea, described in the paper "Effective Biological Representation Learning by Masking Gene Expression," is that masking gene expression counts and reconstructing them is a sufficient and surprisingly effective pretraining objective for RNA-seq. The authors report that this masked-autoencoder approach yields representations that outperform several existing transcriptomics foundation models trained on corpora more than 100x larger, suggesting that data curation and a well-matched objective can matter more than raw corpus size.

TxFM was introduced by Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, and colleagues at Recursion, and presented at the ICLR 2026 Workshop on Foundation Models for Science. The accompanying code is being released openly via the OpenTxFM repository.

Key Features

Masked gene-expression pretraining: TxFM is trained as a masked autoencoder over RNA-seq count vectors, learning to reconstruct held-out gene expression values from the observed context — a self-supervised objective requiring no labels.
Inductive, transferable representations: The pretrained model generalizes to new datasets and embeds unseen samples directly, eliminating the per-dataset re-training step required by transductive approaches.
Data efficiency over scale: The authors report that TxFM matches or exceeds transcriptomics foundation models pretrained on corpora more than 100x larger, emphasizing curated data quality and objective design over sheer corpus size.
Curated pretraining corpus: Pretraining uses DiverseRNA-1.4M, a collection of roughly 1.4 million curated public RNA-seq samples spanning diverse biological contexts.
Open release: Code is being published as OpenTxFM to support reproducibility and community use.

Technical Details

TxFM is a self-supervised masked autoencoder operating directly on RNA-seq count data, with a transformer-style encoder learning contextual representations of gene expression. The training objective masks a subset of gene expression values and tasks the model with reconstructing them, analogous to masked-token objectives in language modeling but adapted to the sparse, high-dimensional, count-valued structure of transcriptomic profiles. Pretraining is performed on DiverseRNA-1.4M, a curated set of approximately 1.4 million public RNA-seq samples assembled to span a broad range of tissues and conditions. In the paper's evaluations, the resulting representations outperform transcriptomics foundation models pretrained on substantially larger corpora across downstream benchmarks, supporting the claim that the masking objective combined with careful curation is more important than corpus scale. As the work is a 2026 workshop paper, these results should be read as recent and preliminary relative to peer-reviewed venues.

Applications

TxFM is intended as a general-purpose representation extractor for transcriptomic data, useful to computational biologists and drug-discovery teams who need transferable embeddings for tasks such as sample characterization, perturbation and phenotype analysis, and clustering or classification of RNA-seq profiles. Because the model is inductive, practitioners can embed new datasets without retraining, lowering the barrier to applying foundation-model representations in routine analysis pipelines and large-scale screening workflows of the kind central to Recursion's platform.

Impact

TxFM contributes to an ongoing debate in the single-cell and transcriptomics foundation-model community about whether bigger pretraining corpora reliably yield better biological representations. By reporting competitive or superior performance against models trained on more than 100x more data, the work argues that objective design and data curation can outweigh scale — a finding with practical implications for groups without access to massive compute or data. As of release, the OpenTxFM repository is still under construction and no pretrained checkpoint has been posted yet, so weight availability is pending; users should expect to track the repository for the model weights and full reproduction code. This honest limitation aside, the open-code commitment and the data-efficiency result make TxFM a noteworthy entry in the transcriptomics foundation-model landscape.

Citation

Effective Biological Representation Learning by Masking Gene Expression

Preprint

Kenyon-Dean, K., et al. (2026) Effective Biological Representation Learning by Masking Gene Expression.

DOI: 10.48550/arXiv.2605.31562

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References57

GitHub

Stars2

Forks0

Open Issues0

Contributors1

Last Push3mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

12Closed

Usability — can I run it?11

Reproducibility — can I retrain it?16

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper

Key Features

Masked gene-expression pretraining: TxFM is trained as a masked autoencoder over RNA-seq count vectors, learning to reconstruct held-out gene expression values from the observed context — a self-supervised objective requiring no labels.

Inductive, transferable representations: The pretrained model generalizes to new datasets and embeds unseen samples directly, eliminating the per-dataset re-training step required by transductive approaches.

Data efficiency over scale: The authors report that TxFM matches or exceeds transcriptomics foundation models pretrained on corpora more than 100x larger, emphasizing curated data quality and objective design over sheer corpus size.

Curated pretraining corpus: Pretraining uses DiverseRNA-1.4M, a collection of roughly 1.4 million curated public RNA-seq samples spanning diverse biological contexts.

Open release: Code is being published as OpenTxFM to support reproducibility and community use.

Technical Details

Applications

Impact

TxFM

Key Features

Technical Details

Applications

Impact

Citation

Effective Biological Representation Learning by Masking Gene Expression

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

TxFM

Key Features

Technical Details

Applications

Impact

Citation

Effective Biological Representation Learning by Masking Gene Expression

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

TxFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

Effective Biological Representation Learning by Masking Gene Expression

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

TxFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

Effective Biological Representation Learning by Masking Gene Expression

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact