bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

DecoderTCR

Chan Zuckerberg Initiative

A masked language model for T-cell receptor and peptide-MHC interaction prediction using compositional pretraining and entropy-guided non-autoregressive decoding.

Released: 2026

Overview

DecoderTCR is a masked language model for predicting interactions between T-cell receptors (TCRs) and peptide-MHC (pMHC) complexes, developed by Boqiao Lai, Melissa Englund, Ramit Bharanikumar, Isabel Nocedal, Ali Davariashtiyani, Jason Perera, and Aly A. Khan at the Chan Zuckerberg Biohub and the University of Chicago. The preprint was posted to bioRxiv in February 2026. DecoderTCR is part of CZI's Virtual Cells Platform and was supported by an NIH DP2 grant (DP2AI177884).

The central challenge in computational TCR immunology is data scarcity. TCRs are generated by V(D)J recombination, a combinatorial assembly process that creates an astronomically diverse repertoire of receptor sequences. Functionally, a TCR's biological identity is defined by its specific recognition of a peptide antigen presented by an MHC molecule — a pairing event that is rare, context-dependent, and extremely difficult to measure systematically. Databases of experimentally validated TCR-pMHC binding pairs are orders of magnitude smaller than databases of uncharacterized TCR sequences. This disparity creates a fundamental obstacle for supervised learning: the most important information to model is the least abundant.

DecoderTCR addresses this imbalance through two complementary innovations. First, a compositional continual pretraining curriculum trains the model on abundant unpaired sequence data before transferring to scarce paired interaction data, allowing it to learn robust representations of individual receptor components before attempting to model their joint recognition. Second, an Iterative Entropy-Guided Refinement (IEGR) decoding algorithm generates TCR sequences non-autoregressively by resolving high-confidence positions first and using them as context for uncertain positions, improving the biological plausibility of generated sequences.

Key Features

  • Compositional continual pretraining: The training curriculum proceeds in two phases. In the first phase, the model learns representations of individual TCR chains (alpha and beta) and pMHC components (peptide and MHC molecule) independently using the much larger corpus of unpaired sequence data. In the second phase, the model refines its understanding of cross-component dependencies using the smaller paired TCR-pMHC interaction dataset. This curriculum efficiently leverages data across both abundance regimes.
  • Iterative Entropy-Guided Refinement (IEGR): Rather than generating sequences one token at a time from left to right, IEGR decodes non-autoregressively by first identifying the positions in the sequence with the lowest predictive entropy — those the model is most certain about — and placing tokens there. These resolved positions then provide context for iteratively resolving progressively more uncertain positions, producing sequences with realistic recombination statistics.
  • Structural contact recovery without coordinate supervision: Representations learned by DecoderTCR recover known structural contacts between CDR3 loops and peptide residues without any explicit supervision on 3D coordinates, suggesting the model internalizes physically meaningful geometry from sequence data alone.
  • Zero-shot pMHC binding discrimination: DecoderTCR achieves strong zero-shot performance on held-out binding prediction tasks, making it useful even in the absence of epitope-specific fine-tuning data — a common situation for novel antigens or rare pathogen peptides.
  • Paired generation of TCR sequences: IEGR supports generation of paired TCR alpha-beta sequences conditioned on a target pMHC, enabling computational design of candidate TCRs for adoptive cell therapy without requiring exhaustive experimental screening.

Technical Details

DecoderTCR is a masked language model with a transformer encoder-decoder architecture trained using a masked token prediction objective over concatenated TCR chain and pMHC sequence representations. The compositional pretraining draws on large databases of unpaired TCR sequences (such as those deposited in immune repertoire databases including VDJdb, McPAS-TCR, and TCR3d) as well as pMHC binding datasets (such as IEDB). The continual pretraining strategy avoids catastrophic forgetting by carefully controlling the learning rate schedule during the transition from component-level to interaction-level training.

On held-out benchmarks, DecoderTCR achieves an AUROC of 0.96 for zero-shot pMHC binding prediction, approaching the performance of strongly supervised models without any epitope-specific training. For epitope-specific TCR recognition — identifying which TCRs from a pool will bind a given peptide antigen — the model achieves an AUROC of 0.76, competitive with supervised baselines trained on labeled pairs. The paper identifies an important prediction-generation gap: DecoderTCR's ability to discriminate binding from non-binding sequences is substantially stronger than its ability to generate novel binders on demand, pointing to an ongoing fundamental challenge in conditional sequence generation for immune receptors. The model is hosted on CZI's Virtual Cells Platform as version 0.1.

Applications

DecoderTCR is suited for researchers in cancer immunotherapy, infectious disease, and autoimmunity who need to analyze TCR repertoires or design TCR sequences with defined specificity. In cancer immunotherapy, the model can prioritize candidate TCRs from tumor-infiltrating lymphocyte repertoires for adoptive cell therapy by ranking sequences by predicted pMHC binding affinity. For infectious disease research, zero-shot pMHC binding prediction enables rapid screening of immunodominant epitopes from novel pathogens without requiring prior experimental binding data. The structural contact recovery capability is useful for understanding the molecular basis of TCR-pMHC recognition and for guiding mutagenesis experiments to improve TCR affinity or specificity. Computational immunologists studying VDJ recombination can also use IEGR-generated sequences as in silico surrogates for repertoire simulation studies.

Impact

DecoderTCR contributes two methodological advances to the TCR modeling field: a curriculum learning strategy that explicitly stages pretraining from marginal to joint sequence distributions, and a non-autoregressive decoding algorithm tailored to the combinatorial structure of immune receptor sequences. The compositional pretraining approach is generalizable beyond TCRs to any paired molecular interaction problem where co-complex data is scarce relative to unpaired component data — including antibody-antigen interactions and receptor-ligand pairs. The demonstrated zero-shot binding discrimination performance substantially exceeds naive baselines and narrows the gap to supervised methods, making the model immediately useful for researchers without access to large labeled training sets. As a v0.1 release with an active development roadmap through the Virtual Cells Platform, further improvements to generation quality and extensions to B-cell receptor modeling are anticipated.

Tags

TCR-pMHC binding predictionimmune repertoire analysistransformerself-supervisedcontrastive learningfoundation modelT-cell receptorimmunology

Resources

Research PaperOfficial Website