In-context learning foundation model for cell-free RNA, pretrained on synthetic tasks from a cfRNA-specific structural causal model for few-shot cancer classification.
Cell-free RNA (cfRNA) circulating in human plasma offers a minimally invasive window into tissue physiology and disease, making it an attractive substrate for liquid-biopsy cancer detection. However, cfRNA data are unusually difficult for machine learning: measurements are extremely sparse and strongly zero-inflated, abundances follow heavy-tailed distributions, library sizes vary widely, and genes differ markedly in detectability. These properties differ substantially from bulk or single-cell transcriptomics, so models that perform well on conventional tabular or expression data tend to generalize poorly when applied directly to cfRNA, a problem compounded by the scarcity of large, well-labeled cfRNA cohorts.
cfRNA-ICL, introduced by Eigen Bio in a December 2025 bioRxiv preprint, addresses this gap with an in-context learning (ICL) approach adapted from the prior-data fitted network (PFN) / TabPFN family. Instead of training on real labeled cohorts, the model is meta-trained entirely on synthetic classification tasks sampled from a biologically grounded structural causal model (SCM) that is purpose-built to reproduce the statistical geometry of cfRNA. At inference the model performs classification in-context: labeled examples are supplied as context and predictions for new samples are produced in a single forward pass, with no task-specific gradient updates.
The central contribution is the domain-specific prior. Where generic tabular ICL models draw tasks from generic SCMs, cfRNA-ICL's SCM encodes empirical cfRNA behavior, giving the model inductive biases matched to real plasma data. The authors report that this yields consistently stronger cancer classification than tabular ICL models trained on generic synthetic data, with the largest gains in few-shot settings.
cfRNA-ICL is a transformer-based prior-data fitted network in the lineage of TabPFN, which uses self-attention among context (training) samples and cross-attention from query (test) samples to those examples to approximate Bayesian inference over the synthetic prior. The defining departure from prior work is the generative prior itself: rather than sampling from generic structural causal models, the authors construct a cfRNA-specific SCM calibrated to empirical measurements of dropout, overdispersion, tissue-mixture-driven latent factors, compositional variability, and sequencing noise, producing a synthetic task distribution whose geometry mirrors real cfRNA. The model is evaluated on multiple cfRNA cancer classification benchmarks against tabular ICL models trained on generic synthetic data, with reported gains most pronounced in few-shot scenarios; the preprint is a single-version bioRxiv release (CC BY-NC-ND), and specific architecture sizes, hyperparameters, and per-benchmark metrics should be confirmed against the full text.
cfRNA-ICL targets liquid-biopsy workflows where plasma cfRNA is profiled for cancer detection and classification. Because it learns in-context from a few labeled examples, it is well suited to settings with limited annotated samples, such as emerging assays, rare cancer types, or new cohorts where assembling large training sets is impractical. Beyond classification, its unsupervised representations could support exploratory analysis of cfRNA structure across oncology and other plasma-based applications, and the SCM-prior framework offers a template for building cfRNA models without large labeled datasets.
cfRNA-ICL demonstrates that tailoring the synthetic prior of an in-context learning model to the statistics of a difficult biological data type can outperform generic tabular foundation models on that domain. It extends the TabPFN/PFN paradigm into liquid biopsy and articulates a practical route toward foundation-scale cfRNA models that are intrinsically adapted to plasma cfRNA rather than retrofitted from general-purpose architectures. As a recent single-version preprint from an industry group, its benchmarks await independent validation and peer review, and code and pretrained weights were not located in public repositories at the time of writing; nonetheless it contributes a concrete strategy for the persistent challenge of label-scarce cfRNA modeling.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data