A self-supervised masked autoencoder for RNA-seq count data, pretrained on 1.4M public samples to learn transferable transcriptomic representations without per-dataset re-training.
TxFM (developed and released under the code name OpenTxFM) is a transcriptomics foundation model from Recursion Pharmaceuticals that learns reusable representations of bulk and single-cell RNA-seq data through self-supervised pretraining. Rather than training a bespoke model for each new dataset, TxFM is designed as an inductive learner: once pretrained, it produces embeddings for previously unseen samples without any per-dataset re-training, making it a drop-in feature extractor for downstream transcriptomic analysis.
The core idea, described in the paper "Effective Biological Representation Learning by Masking Gene Expression," is that masking gene expression counts and reconstructing them is a sufficient and surprisingly effective pretraining objective for RNA-seq. The authors report that this masked-autoencoder approach yields representations that outperform several existing transcriptomics foundation models trained on corpora more than 100x larger, suggesting that data curation and a well-matched objective can matter more than raw corpus size.
TxFM was introduced by Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, and colleagues at Recursion, and presented at the ICLR 2026 Workshop on Foundation Models for Science. The accompanying code is being released openly via the OpenTxFM repository.
TxFM is a self-supervised masked autoencoder operating directly on RNA-seq count data, with a transformer-style encoder learning contextual representations of gene expression. The training objective masks a subset of gene expression values and tasks the model with reconstructing them, analogous to masked-token objectives in language modeling but adapted to the sparse, high-dimensional, count-valued structure of transcriptomic profiles. Pretraining is performed on DiverseRNA-1.4M, a curated set of approximately 1.4 million public RNA-seq samples assembled to span a broad range of tissues and conditions. In the paper's evaluations, the resulting representations outperform transcriptomics foundation models pretrained on substantially larger corpora across downstream benchmarks, supporting the claim that the masking objective combined with careful curation is more important than corpus scale. As the work is a 2026 workshop paper, these results should be read as recent and preliminary relative to peer-reviewed venues.
TxFM is intended as a general-purpose representation extractor for transcriptomic data, useful to computational biologists and drug-discovery teams who need transferable embeddings for tasks such as sample characterization, perturbation and phenotype analysis, and clustering or classification of RNA-seq profiles. Because the model is inductive, practitioners can embed new datasets without retraining, lowering the barrier to applying foundation-model representations in routine analysis pipelines and large-scale screening workflows of the kind central to Recursion's platform.
TxFM contributes to an ongoing debate in the single-cell and transcriptomics foundation-model community about whether bigger pretraining corpora reliably yield better biological representations. By reporting competitive or superior performance against models trained on more than 100x more data, the work argues that objective design and data curation can outweigh scale — a finding with practical implications for groups without access to massive compute or data. As of release, the OpenTxFM repository is still under construction and no pretrained checkpoint has been posted yet, so weight availability is pending; users should expect to track the repository for the model weights and full reproduction code. This honest limitation aside, the open-code commitment and the data-efficiency result make TxFM a noteworthy entry in the transcriptomics foundation-model landscape.