PoET-2

Multimodal, retrieval-augmented protein foundation model that learns family-specific evolutionary constraints with optional structure conditioning.

Released: August 2025

Parameters: 182 Million

PoET-2 is a protein foundation model developed by OpenProtein.AI (the company founded by Tristan Bepler) and released as a preprint in August 2025. It is the successor to PoET, the original "Protein Evolutionary Transformer" that introduced retrieval-augmented, family-centric protein language modeling. Where most protein language models embed a single sequence and rely on scale to capture general evolutionary signal, PoET-2 conditions on a set of related sequences—and optionally structure—at inference time, learning the constraints specific to a given protein family in context rather than baking them entirely into fixed weights.

The model addresses a persistent gap in protein engineering: zero-shot and low-data prediction of how mutations affect function. By performing in-context learning over retrieved homologs, PoET-2 can adapt to a target family without retraining, and its optional structure conditioning lets it incorporate 3D information when an experimental or predicted structure is available. The result is a single model that supports both zero-shot variant effect prediction and controllable sequence generation.

Notably, PoET-2 reaches performance competitive with much larger models at roughly 182 million parameters, reinforcing the argument from the PoET line of work that retrieval augmentation and family-centric conditioning can substitute for raw parameter scaling on many protein tasks.

Key Features

Retrieval-augmented in-context learning: Conditions on sets of evolutionarily related sequences at inference time, capturing family-specific constraints without per-task fine-tuning.
Optional structure conditioning: Incorporates 3D structural information as an additional modality when available, making the model multimodal over sequence and structure.
Dual decoders: A causal (generative) decoder and a masked (bidirectional) decoder support both controllable sequence generation and rich representation learning from one backbone.
State-of-the-art zero-shot variant effects: Achieves strong zero-shot variant effect prediction, including for multi-mutation variants and challenging insertion/deletion (indel) mutations that many models handle poorly.
Strong low-data supervision: Embeddings outperform prior methods on supervised sequence-function tasks, particularly when only small amounts of labeled data are available—reported to reduce the experimental data needed for protein engineering by roughly 30-fold.

Technical Details

PoET-2 uses a hierarchical transformer encoder that is equivariant to the ordering of the sequences provided in context, paired with a dual decoder architecture trained with both causal and masked language modeling objectives. This design lets the same model operate generatively (sampling new sequences) and bidirectionally (producing embeddings and scoring variants). Structure, when supplied, is treated as an additional input modality alongside sequence sets, and retrieval of homologous sequences provides the family-specific evolutionary context the model conditions on. At approximately 182 million parameters, the model is small relative to many contemporary protein language models yet reported to match or exceed their performance. On zero-shot variant effect prediction benchmarks the authors report state-of-the-art results, with particular gains on multi-mutant and indel variants; in supervised settings, PoET-2 embeddings improve sequence-function modeling, most markedly in the small-dataset regime. Detailed training-corpus composition and full benchmark tables are described in the arXiv preprint (2508.04724).

Applications

PoET-2 is aimed at protein engineering and design workflows where labeled functional data is scarce and expensive to generate. Researchers can use it for zero-shot ranking of candidate mutations (including multi-site and indel variants), for guiding directed-evolution and library-design campaigns, and for generating novel sequences within a target family under controllable constraints. Its embeddings serve as features for supervised property predictors—affinity, stability, expression, activity—where the model's strong low-data performance can substantially shrink the number of wet-lab measurements required. The model is accessible both as open code and weights on GitHub and through OpenProtein.AI's platform and documentation, lowering the barrier for teams without large in-house training infrastructure.

Impact

As the successor to PoET, PoET-2 extends a line of work arguing that retrieval augmentation and family-centric conditioning are an effective alternative to scaling parameters for protein modeling. The headline claim—competitive accuracy at ~182M parameters with a roughly 30-fold reduction in experimental data needed for engineering—is significant for groups operating under realistic labeling budgets, and the unified handling of zero-shot scoring, supervised representation learning, and controllable generation in one model is a practical advantage. As of release the work is a preprint, so its benchmark claims await broader independent replication, and the practical benefit of structure conditioning depends on the availability and quality of input structures. Independent comparisons against models such as ESM-2, ProGen, and other retrieval-augmented approaches will help establish where PoET-2's family-centric design offers the largest gains.

Citation

Understanding protein function with a multimodal retrieval-augmented foundation model

Preprint

Truong, T. F. & Bepler, T. (2025) Understanding protein function with a multimodal retrieval-augmented foundation model. arXiv.org.

DOI: 10.48550/arXiv.2508.04724

Recent citations

Papers that recently cited this model.

Flexible Flows for Biological Sequence Design
Yogesh Verma, Dani Korpela, H. Lahdesmaki, et al.
Jun 2026
0
EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering
Nicolas Deutschmann, C. Ferragu, Jonathan D. Ziegler, et al.
Mar 2026
2
Evolutionary profile enhancement improves protein function annotation for remote homologs
Shitong Dai, Jiaqi Luo, Yunan Luo
bioRxiv · Mar 2026
0

Top citations

The most-cited papers that cite this model.

Steering generative models for protein design: Aligning and conditioning strategies.
Filippo Stocco, Michele Garibbo, Noelia Ferruz
Current Opinion in Structural Biology · Nov 2025
5
EvoPool: Evolution-Guided Pooling of Protein Language Model Embeddings
Navid NaderiAlizadeh, Rohit Singh
bioRxiv · Feb 2026
3
EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering
Nicolas Deutschmann, C. Ferragu, Jonathan D. Ziegler, et al.
Mar 2026
2
BioR5: A Three-Layer Architecture for Biological Reasoning in Scientific AI
Peng Ding, Thomas S. Brettin, Rick Stevens
SC25-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis · Nov 2025
1
Protein Language Model–Aligned Spectra Embeddings for De Novo Peptide Sequencing
Navid Naderializadeh, Christian Dallago, E. Soderblom, et al.
bioRxiv · Oct 2025
1

Citations

Total Citations112

Influential12

References47

GitHub

Stars27

Forks6

Open Issues1

Contributors1

Last Push11mo ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Biology100%
Computer Science100%
Medicine33%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

37Closed

Usability — can I run it?59

Reproducibility — can I retrain it?14

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website Documentation

Key Features

Retrieval-augmented in-context learning: Conditions on sets of evolutionarily related sequences at inference time, capturing family-specific constraints without per-task fine-tuning.

Optional structure conditioning: Incorporates 3D structural information as an additional modality when available, making the model multimodal over sequence and structure.

Dual decoders: A causal (generative) decoder and a masked (bidirectional) decoder support both controllable sequence generation and rich representation learning from one backbone.

State-of-the-art zero-shot variant effects: Achieves strong zero-shot variant effect prediction, including for multi-mutation variants and challenging insertion/deletion (indel) mutations that many models handle poorly.

Strong low-data supervision: Embeddings outperform prior methods on supervised sequence-function tasks, particularly when only small amounts of labeled data are available—reported to reduce the experimental data needed for protein engineering by roughly 30-fold.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Flexible Flows for Biological Sequence Design

Yogesh Verma, Dani Korpela, H. Lahdesmaki, et al.

Jun 2026

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Nicolas Deutschmann, C. Ferragu, Jonathan D. Ziegler, et al.

Mar 2026

Evolutionary profile enhancement improves protein function annotation for remote homologs

Shitong Dai, Jiaqi Luo, Yunan Luo

bioRxiv · Mar 2026

Top citations

The most-cited papers that cite this model.

Steering generative models for protein design: Aligning and conditioning strategies.

Filippo Stocco, Michele Garibbo, Noelia Ferruz

Current Opinion in Structural Biology · Nov 2025

EvoPool: Evolution-Guided Pooling of Protein Language Model Embeddings

Navid NaderiAlizadeh, Rohit Singh

bioRxiv · Feb 2026

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Nicolas Deutschmann, C. Ferragu, Jonathan D. Ziegler, et al.

Mar 2026

BioR5: A Three-Layer Architecture for Biological Reasoning in Scientific AI

Peng Ding, Thomas S. Brettin, Rick Stevens

SC25-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis · Nov 2025

Protein Language Model–Aligned Spectra Embeddings for De Novo Peptide Sequencing

Navid Naderializadeh, Christian Dallago, E. Soderblom, et al.

bioRxiv · Oct 2025

PoET-2

#Key Features

#Technical Details

#Applications

#Impact

Citation

Understanding protein function with a multimodal retrieval-augmented foundation model

Recent citations

Flexible Flows for Biological Sequence Design

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Top citations

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PoET-2

#Key Features

#Technical Details

#Applications

#Impact

Citation

Understanding protein function with a multimodal retrieval-augmented foundation model

Recent citations

Flexible Flows for Biological Sequence Design

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Top citations

EvoFlows: Evolutionary Edit-Based Flow-Matching for Protein Engineering

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact