bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene

TESSERA

Weill Cornell Medicine

Self-supervised foundation model that learns reusable representations of cancer genomes from somatic SNVs and copy-number alterations across 33 tumor types.

Released: June 2026

TESSERA ("Tumour Embeddings via Self-Supervised Encoding and Reconstruction of Alterations") is a self-supervised foundation model that learns reusable representations of the cancer genome directly from somatic mutation profiles. Rather than building a separate predictor for each clinical question, TESSERA encodes a tumor's somatic single-nucleotide variants (SNVs) and copy-number alterations (CNAs) into dense embeddings that transfer across many downstream tasks, mirroring the foundation-model paradigm that has reshaped protein and single-cell biology.

The model was developed by J.-W. Sidhom, A. S. Baras, O. Elemento, and M. A. Shah at Weill Cornell Medicine and released as a bioRxiv preprint in June 2026. It addresses a persistent gap in computational oncology: most genomic classifiers are trained end-to-end on a single label, making them brittle and hard to reuse. By pretraining on the somatic alteration landscape itself, TESSERA produces a general-purpose feature space that supports tumor classification, molecular subtyping, prognosis, and treatment-effect estimation from one set of learned representations.

A notable result is the model's interpretability: the authors derive a compact three-feature rule for colorectal cancer treatment selection based on TP53, KRAS, and chromosome 17p alterations, illustrating that the learned embeddings can yield clinically legible biomarkers rather than opaque scores.

#Key Features

  • Joint SNV + CNA encoding: Separate encoders produce 1,169-dimensional embeddings per somatic SNV and 688-dimensional embeddings per copy-number segment, capturing both mutational and structural genome alterations.
  • InfoNCE-aligned multimodal training: The SNV and CNA representations are aligned with a contrastive InfoNCE objective, yielding a shared embedding space across the two alteration modalities.
  • Reusable representations: A single pretrained model supports variant pathogenicity prediction, pan-cancer tumor-type classification, unsupervised molecular subtyping, prognostic stratification, and counterfactual treatment-effect estimation without task-specific retraining.
  • Zero-shot and transfer learning: Embeddings can be applied off-the-shelf or used as features for lightweight downstream models, including cross-platform validation on targeted panel sequencing and cell lines.
  • Open tooling and precomputed features: Distributed as the tessera-foundation PyPI package with weights auto-downloaded via load_pretrained(), plus precomputed TCGA embeddings on Zenodo for ~1.9M SNVs and ~1.8M CNA segments.

#Technical Details

TESSERA is pretrained on the TCGA Pan-Cancer Atlas, spanning more than 10,000 patients, over 3 million somatic variants, and 33 cancer types. The architecture uses custom attention, masking, and multiple-instance-learning layers to aggregate variable numbers of per-variant and per-segment alterations into a fixed sample-level representation; the joint model concatenates mean- and max-pooled features across modalities into a 3,714-dimensional sample embedding. Training is fully self-supervised, requiring no clinical labels, with the cross-modal InfoNCE objective aligning SNV and CNA spaces. The pretrained weights are approximately 185 MB and are hosted on the Hugging Face Hub (CC-BY-NC-4.0), while the GRCh37 reference is fetched automatically on first SNV inference. Code is released under the PolyForm Noncommercial License 1.0.0.

#Applications

TESSERA is aimed at cancer genomics researchers and computational oncologists who need flexible, reusable representations of tumor mutation profiles. Because embeddings transfer across tasks, the same model can power tumor-of-origin classification for cancers of unknown primary, unsupervised discovery of molecular subtypes, survival and prognostic modeling, and counterfactual estimation of treatment benefit. Its demonstrated portability to panel sequencing and cell-line data suggests utility in translational pipelines where whole-exome data are unavailable.

#Impact

By extending the self-supervised foundation-model approach from sequence and structure to the somatic alteration landscape, TESSERA offers a unified representation layer for cancer genome analysis that could reduce duplicated, label-hungry model development across oncology. The accompanying open package, pretrained weights, and precomputed TCGA feature sets lower the barrier to reuse, while the interpretable colorectal cancer decision rule shows the embeddings can surface actionable biomarkers. As a recent preprint, its real-world adoption and independent validation remain to be established, and its noncommercial licensing constrains downstream clinical deployment.

Citation

DOI: 10.64898/2026.05.27.728319

DOI: 10.64898/2026.05.27.728319

Openness

Unclassified
Restrictive license on core components

Tags

cancer_genomicscell_type_annotationcontrastive_learningfoundation_modelmultiple_instance_learningrepresentation_learningself_supervisedtransformervariant_effect_prediction

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset