TESSERA

Self-supervised foundation model that embeds cancer genomes from somatic SNVs and copy-number alterations across 33 tumor types for tumor subtyping.

Released: June 2026

TESSERA ("Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations") is a self-supervised foundation model that learns reusable representations of the cancer genome directly from somatic mutation profiles. Rather than building a separate predictor for each clinical question, TESSERA encodes a tumor's somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) into dense embeddings that transfer across many downstream tasks, mirroring the foundation-model paradigm that has reshaped protein and single-cell biology.

The model was developed by J.-W. Sidhom, A. S. Baras, O. Elemento, and M. A. Shah at Weill Cornell Medicine and released as a bioRxiv preprint in June 2026. It addresses a persistent gap in computational oncology: most genomic classifiers are trained end-to-end on a single label, making them brittle and hard to reuse. By pretraining on the somatic alteration landscape itself, TESSERA produces a general-purpose feature space that supports tumor classification, molecular subtyping, prognosis, and treatment-effect estimation from one set of learned representations.

A notable result is the model's interpretability: the authors derive a compact three-feature rule for colorectal cancer treatment selection based on TP53, KRAS, and chromosome 17p alterations, illustrating that the learned embeddings can yield clinically legible biomarkers rather than opaque scores.

Key Features

Joint SNV + CNA encoding: Separate encoders produce 1,169-dimensional embeddings per somatic SNV and 688-dimensional embeddings per copy-number segment, capturing both mutational and structural genome alterations.
InfoNCE-aligned multimodal training: The SNV and CNA representations are aligned with a contrastive InfoNCE objective, yielding a shared embedding space across the two alteration modalities.
Reusable representations: A single pretrained model supports variant pathogenicity prediction, pan-cancer tumor-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation without task-specific retraining.
Zero-shot and transfer learning: Embeddings can be applied off-the-shelf or used as features for lightweight downstream models, including cross-platform validation on targeted panel sequencing and cell lines.
Open tooling and precomputed features: Distributed as the tessera-foundation PyPI package with weights auto-downloaded via load_pretrained(), plus precomputed TCGA embeddings on Zenodo for ~1.9M SNVs and ~1.8M CNA segments.

Technical Details

TESSERA is pretrained on the TCGA Pan-Cancer Atlas, spanning more than 10,000 patients, over 3 million somatic variants, and 33 cancer types. The architecture uses custom attention, masking, and multiple-instance-learning layers to aggregate variable numbers of per-variant and per-segment alterations into a fixed sample-level representation; the joint model concatenates mean- and max-pooled features across modalities into a 3,714-dimensional sample embedding. Training is fully self-supervised, requiring no clinical labels, with the cross-modal InfoNCE objective aligning SNV and CNA spaces. The pretrained weights are approximately 185 MB and are hosted on the Hugging Face Hub (CC-BY-NC-4.0), while the GRCh37 reference is fetched automatically on first SNV inference. Code is released under the PolyForm Noncommercial License 1.0.0.

Applications

TESSERA is aimed at cancer genomics researchers and computational oncologists who need flexible, reusable representations of tumor mutation profiles. Because embeddings transfer across tasks, the same model can power tumor-of-origin classification for cancers of unknown primary, unsupervised discovery of molecular subtypes, survival and prognostic modeling, and counterfactual estimation of treatment benefit. Its demonstrated portability to panel sequencing and cell-line data suggests utility in translational pipelines where whole-exome data are unavailable.

Impact

By extending the self-supervised foundation-model approach from sequence and structure to the somatic alteration landscape, TESSERA offers a unified representation layer for cancer genome analysis that could reduce duplicated, label-hungry model development across oncology. The accompanying open package, pretrained weights, and precomputed TCGA feature sets lower the barrier to reuse, while the interpretable colorectal cancer decision rule shows the embeddings can surface actionable biomarkers. As a recent preprint, its real-world adoption and independent validation remain to be established, and its noncommercial licensing constrains downstream clinical deployment.

Citation

A Foundation Model for the Cancer Genome

Sidhom, J., et al. (2026) A Foundation Model for the Cancer Genome. bioRxiv.

DOI: 10.64898/2026.05.27.728319

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References0

GitHub

Stars5

Forks1

Open Issues0

Contributors1

Last Push28d ago

LanguagePython

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

28Closed

Usability — can I run it?17

Reproducibility — can I retrain it?31

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Joint SNV + CNA encoding: Separate encoders produce 1,169-dimensional embeddings per somatic SNV and 688-dimensional embeddings per copy-number segment, capturing both mutational and structural genome alterations.

InfoNCE-aligned multimodal training: The SNV and CNA representations are aligned with a contrastive InfoNCE objective, yielding a shared embedding space across the two alteration modalities.

Reusable representations: A single pretrained model supports variant pathogenicity prediction, pan-cancer tumor-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation without task-specific retraining.

Zero-shot and transfer learning: Embeddings can be applied off-the-shelf or used as features for lightweight downstream models, including cross-platform validation on targeted panel sequencing and cell lines.

Open tooling and precomputed features: Distributed as the tessera-foundation PyPI package with weights auto-downloaded via load_pretrained(), plus precomputed TCGA embeddings on Zenodo for ~1.9M SNVs and ~1.8M CNA segments.

Technical Details

Applications

Impact

TESSERA

Key Features

Technical Details

Applications

Impact

Citation

A Foundation Model for the Cancer Genome

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

TESSERA

Key Features

Technical Details

Applications

Impact

Citation

A Foundation Model for the Cancer Genome

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

TESSERA

#Key Features

#Technical Details

#Applications

#Impact

Citation

A Foundation Model for the Cancer Genome

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

TESSERA

#Key Features

#Technical Details

#Applications

#Impact

Citation

A Foundation Model for the Cancer Genome

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact