Self-supervised foundation model that learns reusable representations of cancer genomes from somatic SNVs and copy-number alterations across 33 tumor types.
TESSERA ("Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations") is a self-supervised foundation model that learns reusable representations of the cancer genome directly from somatic mutation profiles. Rather than building a separate predictor for each clinical question, TESSERA encodes a tumor's somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) into dense embeddings that transfer across many downstream tasks, mirroring the foundation-model paradigm that has reshaped protein and single-cell biology.
The model was developed by J.-W. Sidhom, A. S. Baras, O. Elemento, and M. A. Shah at Weill Cornell Medicine and released as a bioRxiv preprint in June 2026. It addresses a persistent gap in computational oncology: most genomic classifiers are trained end-to-end on a single label, making them brittle and hard to reuse. By pretraining on the somatic alteration landscape itself, TESSERA produces a general-purpose feature space that supports tumor classification, molecular subtyping, prognosis, and treatment-effect estimation from one set of learned representations.
A notable result is the model's interpretability: the authors derive a compact three-feature rule for colorectal cancer treatment selection based on TP53, KRAS, and chromosome 17p alterations, illustrating that the learned embeddings can yield clinically legible biomarkers rather than opaque scores.
tessera-foundation PyPI package with weights auto-downloaded via load_pretrained(), plus precomputed TCGA embeddings on Zenodo for ~1.9M SNVs and ~1.8M CNA segments.TESSERA is pretrained on the TCGA Pan-Cancer Atlas, spanning more than 10,000 patients, over 3 million somatic variants, and 33 cancer types. The architecture uses custom attention, masking, and multiple-instance-learning layers to aggregate variable numbers of per-variant and per-segment alterations into a fixed sample-level representation; the joint model concatenates mean- and max-pooled features across modalities into a 3,714-dimensional sample embedding. Training is fully self-supervised, requiring no clinical labels, with the cross-modal InfoNCE objective aligning SNV and CNA spaces. The pretrained weights are approximately 185 MB and are hosted on the Hugging Face Hub (CC-BY-NC-4.0), while the GRCh37 reference is fetched automatically on first SNV inference. Code is released under the PolyForm Noncommercial License 1.0.0.
TESSERA is aimed at cancer genomics researchers and computational oncologists who need flexible, reusable representations of tumor mutation profiles. Because embeddings transfer across tasks, the same model can power tumor-of-origin classification for cancers of unknown primary, unsupervised discovery of molecular subtypes, survival and prognostic modeling, and counterfactual estimation of treatment benefit. Its demonstrated portability to panel sequencing and cell-line data suggests utility in translational pipelines where whole-exome data are unavailable.
By extending the self-supervised foundation-model approach from sequence and structure to the somatic alteration landscape, TESSERA offers a unified representation layer for cancer genome analysis that could reduce duplicated, label-hungry model development across oncology. The accompanying open package, pretrained weights, and precomputed TCGA feature sets lower the barrier to reuse, while the interpretable colorectal cancer decision rule shows the embeddings can surface actionable biomarkers. As a recent preprint, its real-world adoption and independent validation remain to be established, and its noncommercial licensing constrains downstream clinical deployment.