bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

PoET-2

OpenProtein.AI

Multimodal, retrieval-augmented protein foundation model that learns family-specific evolutionary constraints with optional structure conditioning.

Released: August 2025
Parameters: 182 Million

PoET-2 is a protein foundation model developed by OpenProtein.AI (the company founded by Tristan Bepler) and released as a preprint in August 2025. It is the successor to PoET, the original "Protein Evolutionary Transformer" that introduced retrieval-augmented, family-centric protein language modeling. Where most protein language models embed a single sequence and rely on scale to capture general evolutionary signal, PoET-2 conditions on a set of related sequences—and optionally structure—at inference time, learning the constraints specific to a given protein family in context rather than baking them entirely into fixed weights.

The model addresses a persistent gap in protein engineering: zero-shot and low-data prediction of how mutations affect function. By performing in-context learning over retrieved homologs, PoET-2 can adapt to a target family without retraining, and its optional structure conditioning lets it incorporate 3D information when an experimental or predicted structure is available. The result is a single model that supports both zero-shot variant effect prediction and controllable sequence generation.

Notably, PoET-2 reaches performance competitive with much larger models at roughly 182 million parameters, reinforcing the argument from the PoET line of work that retrieval augmentation and family-centric conditioning can substitute for raw parameter scaling on many protein tasks.

#Key Features

  • Retrieval-augmented in-context learning: Conditions on sets of evolutionarily related sequences at inference time, capturing family-specific constraints without per-task fine-tuning.
  • Optional structure conditioning: Incorporates 3D structural information as an additional modality when available, making the model multimodal over sequence and structure.
  • Dual decoders: A causal (generative) decoder and a masked (bidirectional) decoder support both controllable sequence generation and rich representation learning from one backbone.
  • State-of-the-art zero-shot variant effects: Achieves strong zero-shot variant effect prediction, including for multi-mutation variants and challenging insertion/deletion (indel) mutations that many models handle poorly.
  • Strong low-data supervision: Embeddings outperform prior methods on supervised sequence-function tasks, particularly when only small amounts of labeled data are available—reported to reduce the experimental data needed for protein engineering by roughly 30-fold.

#Technical Details

PoET-2 uses a hierarchical transformer encoder that is equivariant to the ordering of the sequences provided in context, paired with a dual decoder architecture trained with both causal and masked language modeling objectives. This design lets the same model operate generatively (sampling new sequences) and bidirectionally (producing embeddings and scoring variants). Structure, when supplied, is treated as an additional input modality alongside sequence sets, and retrieval of homologous sequences provides the family-specific evolutionary context the model conditions on. At approximately 182 million parameters, the model is small relative to many contemporary protein language models yet reported to match or exceed their performance. On zero-shot variant effect prediction benchmarks the authors report state-of-the-art results, with particular gains on multi-mutant and indel variants; in supervised settings, PoET-2 embeddings improve sequence-function modeling, most markedly in the small-dataset regime. Detailed training-corpus composition and full benchmark tables are described in the arXiv preprint (2508.04724).

#Applications

PoET-2 is aimed at protein engineering and design workflows where labeled functional data is scarce and expensive to generate. Researchers can use it for zero-shot ranking of candidate mutations (including multi-site and indel variants), for guiding directed-evolution and library-design campaigns, and for generating novel sequences within a target family under controllable constraints. Its embeddings serve as features for supervised property predictors—affinity, stability, expression, activity—where the model's strong low-data performance can substantially shrink the number of wet-lab measurements required. The model is accessible both as open code and weights on GitHub and through OpenProtein.AI's platform and documentation, lowering the barrier for teams without large in-house training infrastructure.

#Impact

As the successor to PoET, PoET-2 extends a line of work arguing that retrieval augmentation and family-centric conditioning are an effective alternative to scaling parameters for protein modeling. The headline claim—competitive accuracy at ~182M parameters with a roughly 30-fold reduction in experimental data needed for engineering—is significant for groups operating under realistic labeling budgets, and the unified handling of zero-shot scoring, supervised representation learning, and controllable generation in one model is a practical advantage. As of release the work is a preprint, so its benchmark claims await broader independent replication, and the practical benefit of structure conditioning depends on the availability and quality of input structures. Independent comparisons against models such as ESM-2, ProGen, and other retrieval-augmented approaches will help establish where PoET-2's family-centric design offers the largest gains.

Citation

Understanding protein function with a multimodal retrieval-augmented foundation model

Preprint

Truong, T. F. & Bepler, T. (2025) Understanding protein function with a multimodal retrieval-augmented foundation model. arXiv.org.

DOI: 10.48550/arXiv.2508.04724

Citations

Total Citations9

GitHub

Stars26
Forks6

Openness

Unclassified
Restrictive license on core components

Tags

foundation_modellanguage_modelprotein_designproteomicsrepresentation_learningretrieval_augmentedtransformervariant_effect_prediction

Resources

GitHub RepositoryResearch PaperOfficial WebsiteDocumentation