Harvard University / University of Copenhagen / Novo Nordisk / Microsoft Research / Technical University of Denmark
Transformer for few-shot protein fitness prediction that combines in-context learning with test-time training to adapt to new proteins and assays.
PRIMO (PRotein In-context Mutation Oracle) is a transformer-based framework for few-shot protein fitness prediction. Protein engineers frequently need to rank variants of a target protein after measuring only a handful of examples—often no more than a single 96-well plate—yet most supervised fitness models require hundreds of labeled observations, including a separate validation set to prevent overfitting. PRIMO addresses this gap by combining in-context learning (ICL) with test-time training (TTT), allowing it to adapt rapidly to a new protein or assay without large task-specific datasets and without a dedicated validation split.
The model was introduced in December 2025 by Felix Teufel, Aaron Kollasch, Yining Huang, Ole Winther, Kevin Yang, Pascal Notin, and Debora Marks, spanning Harvard Medical School, the University of Copenhagen, Novo Nordisk, Microsoft Research (Cambridge, MA), and the Technical University of Denmark. It builds on the ProteinGym / Tranception / ProteinNPT lineage from the Marks and Notin groups, and was published at the AI for Science Workshop at NeurIPS 2025.
PRIMO's central idea is to pre-train a single model across many deep mutational scanning (DMS) assays so that it learns to extract fitness signal from labeled context sets, then sharpen that model on each new task at inference time. Unlike prior set-based methods such as ProteinNPT or Metalic, PRIMO handles both substitution and insertion/deletion (indel) variants, broadening its applicability across protein engineering tasks.
PRIMO is a masked language model with 6 PRIMO layers, a hidden size of 400, 8 attention heads, and a feedforward factor of 4. Amino acid sequences are embedded with a frozen ESM-2 650M protein language model, and autoregressive zero-shot scores come from ProGen2-medium; both pretrained models stay frozen during training. Each PRIMO layer combines per-sequence self-attention with an attention-pooling step (3 pooled vectors per sequence) and pooled cross-sequence attention, keeping the limiting complexity at O(NL²) rather than the O(N²L²) of full sequence-of-sequences attention. It uses rotary positional embeddings, pre-LayerNorm, and skip connections. Pre-training draws 150,000 sets of size N=32 (sequences cropped to 512 residues) from 116 ProteinGym DMS assays spanning stability, enzymatic activity, abundance, fluorescence, and binding, on a single RTX 6000 GPU. On a sequence-identity-controlled held-out split, PRIMO with TTT improves from an average Spearman correlation of 0.51 at zero shots to 0.67 at 128 shots, outperforming Gaussian process, ridge regression, and random forest baselines at every level of N, and beating Metalic on a clean split. On a new "natural evolution" benchmark (chorismate mutase, Rubisco, PPAT), PRIMO with TTT reaches 0.30 Spearman at 32 shots versus roughly 0.24 for the baselines.
PRIMO targets protein engineering campaigns where labeled fitness data is scarce and expensive to generate. After measuring a small number of variants—for properties such as thermostability, enzymatic activity, binding affinity, or fluorescence—researchers can use PRIMO to prioritize promising candidates for the next experimental round, including designs involving insertions and deletions that many variant-effect models cannot score. Its ability to operate without a validation set makes it suitable for directed-evolution and machine-learning-guided design workflows constrained to a single plate of measurements.
PRIMO demonstrates that pre-training across diverse deep mutational scans, followed by efficient test-time adaptation, can deliver state-of-the-art few-shot fitness prediction while supporting both substitutions and indels. Equally influential is the paper's methodological critique: by exposing how sequence-identity overlap between train and test partitions inflates reported "zero-shot" performance, it underscores the need for fit-for-purpose data splits in protein fitness benchmarking. The model is trained only on ProteinGym's 116 assays, which the authors note limits pure in-context learning; pretrained weights are not released, and the public code targets reproduction from ProteinGym rather than turnkey inference, so adoption currently requires retraining. Even so, PRIMO offers a clear template for ICL-plus-TTT approaches as larger, more diverse fitness datasets become available.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data