bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

Proust

ETH Zurich

A 309M-parameter causal protein language model that combines zero-shot fitness prediction with native sequence generation through efficient transformer design.

Released: February 2026
Parameters: 309 Million

Proust is a 309-million-parameter causal protein language model designed to close a long-standing gap in the field: masked protein language models (such as the ESM series) are strong at scoring the fitness effect of mutations, while causal (autoregressive) models are needed for generating new sequences, but each family has historically been weak at the other task. Proust is built to do both well — it estimates variant fitness zero-shot while retaining the generative capability of an autoregressive model. It was introduced by Furkan Eris (ETH Zurich) in a February 2026 arXiv preprint.

The central claim of the work is that careful architectural design, rather than sheer scale, can make a causal model competitive with much larger masked models on fitness prediction. Proust reaches a Spearman correlation of 0.390 on the ProteinGym substitution benchmark and reports state-of-the-art results on indel (insertion/deletion) tasks, while remaining small and inexpensive to train relative to contemporary protein language models.

Proust is positioned alongside generative protein language models like ProtGPT2 and RITA, but with an explicit emphasis on matching the variant-effect performance usually associated with bidirectional masked models. Both code and pretrained weights have been released, making it directly usable for scoring and embedding extraction.

#Key Features

  • Dual capability: A single causal model performs both zero-shot fitness estimation and native autoregressive sequence generation, avoiding the usual tradeoff between masked and causal modeling objectives.
  • Efficient attention design: The GQA-S2 transformer uses grouped-query attention with key-value sharing and depthwise causal convolutions, reducing memory cost while preserving sequence modeling quality.
  • Strong indel performance: The authors report state-of-the-art results on insertion/deletion variant tasks, a regime where many protein language models struggle.
  • Compute-efficient training: The model was trained on roughly 33 billion tokens in about 40 B200 GPU-hours, modest compared to larger contemporaries.
  • Released weights and code: Pretrained checkpoints (nappenstance/proust_v0) and inference code are publicly available for log-likelihood scoring and embedding extraction.

#Technical Details

Proust is a 24-layer decoder-only transformer with a hidden dimension of 1,024, 16 attention heads, and 2 key-value heads, totalling 309 million parameters. The architecture, termed GQA-S2, combines grouped-query attention with KV-sharing and rotary position information, augmented by depthwise causal convolutions and cross-layer value residuals to improve representation quality without increasing model size. It uses an ESM-style 32-token vocabulary (20 standard amino acids plus special tokens). Training consumed approximately 33 billion tokens in roughly 40 B200 GPU-hours.

On the ProteinGym substitution benchmark, Proust attains a Spearman correlation of 0.390, competitive with masked models several times larger, and the authors report state-of-the-art performance on indel tasks and strong results on the EVEREST viral fitness benchmarks. Code and weights are distributed under a PolyForm Noncommercial license, with weights downloaded automatically from Hugging Face on first use.

#Applications

Proust is intended for protein engineering and variant-effect workflows where both scoring and generation are useful. Because it produces zero-shot fitness estimates from sequence log-likelihoods, it can rank point mutations, insertions, and deletions without task-specific labeled data, which is valuable for prioritizing variants in directed-evolution and stability-engineering campaigns. Its autoregressive nature also allows sampling of novel candidate sequences, and its embedding interface supports downstream property prediction. The small footprint makes it practical for groups without large GPU budgets.

#Impact

Proust contributes to an ongoing line of work questioning whether large model scale is necessary for strong protein language modeling, showing that a 309M-parameter causal model can rival much larger masked models on fitness benchmarks while remaining generative. As a recent (February 2026) preprint, its broader adoption and independent validation are still emerging, and reported benchmark numbers come from the authors. The noncommercial license may limit some industrial use, but the public release of weights and inference code lowers the barrier for academic experimentation with efficient causal protein models.

Tags

variant_effect_predictionprotein_designtransformerlanguage_modelzero_shotfoundation_model