Proteo-R1

Stanford University / University of Tokyo / RIKEN Center for Advanced Intelligence Project / Chinese University of Hong Kong

Reasoning-guided foundation model for de novo antibody CDR design, pairing a multimodal LLM understanding expert with a Boltz-1 diffusion expert.

Released: May 2026

Proteo-R1 is a reasoning-guided protein design foundation model for de novo antibody complementarity-determining region (CDR) design. Most generative protein models map a target directly to a designed sequence or structure, leaving the underlying "why these residues" reasoning implicit. Proteo-R1 instead separates molecular understanding from geometric generation: a multimodal large language model first reasons over a sequence and structure to identify the functionally critical residues, then hands those decisions as hard constraints to a diffusion model that builds the corresponding three-dimensional structure.

The system is built from two cooperating experts. The understanding expert couples a Qwen3-4B language model with a Protenix structural encoder, giving the LLM access to residue-level geometric context so it can reason about which positions drive binding. The generation expert is a Boltz-1-based conditional diffusion model that performs framework inpainting and diffusion sampling to design CDR loops under the constraints emitted by the understanding expert.

Proteo-R1 was introduced in 2026 by a collaboration led by researchers at Stanford University (including Jure Leskovec and Yejin Choi), with contributors from the University of Tokyo and RIKEN AIP (Naoto Yokoya, Masashi Sugiyama) and the Chinese University of Hong Kong (Pheng-Ann Heng). It was accepted to ICML 2026.

Key Features

Reasoning-then-design pipeline: A multimodal LLM identifies functionally critical residues and passes them as explicit constraints to the generator, separating molecular understanding from geometric generation rather than predicting structure end-to-end.
Dual-expert architecture: An understanding expert (Qwen3-4B paired with a Protenix encoder) handles residue-level reasoning, while a generation expert built on Boltz-1 conditional diffusion handles structure synthesis.
Antibody CDR specialization: The model targets de novo design of antibody CDR loops, including the difficult CDR-H3, the most variable and binding-relevant loop.
Inference-only release: The published checkpoints support an inference CLI (proteor1-prepare-cdr, proteor1-design); the framework ships with fixed weights and is not intended for user-side training.

Technical Details

Proteo-R1 is trained through a three-stage curriculum on protein structures from the Protein Data Bank (PDB) together with antibody-antigen complexes from SAbDab, producing fixed weights for inference. The understanding expert is a roughly 4-billion-parameter Qwen3 model augmented with a Protenix structural encoder; the generation expert is a Boltz-1-based conditional diffusion model. On the RAbD CDR-H3 design benchmark, Proteo-R1 reaches a DockQ of 0.801, substantially above the reported baseline of 0.473, indicating markedly more accurate reconstruction of bound antibody-antigen geometry. The reference implementation is released on GitHub under Apache 2.0, and the two checkpoints (thinking-bio-lab/proteor1-understand and thinking-bio-lab/proteor1-generate) download automatically from HuggingFace on first inference.

Applications

Proteo-R1 is aimed at computational antibody engineering and de novo binder design. Given an antigen and antibody framework, it proposes CDR sequences and structures predicted to bind, which is useful for therapeutic antibody discovery, affinity optimization, and prospective design campaigns that are then validated experimentally. Because the understanding expert exposes which residues it deems functionally important, the workflow can also help researchers interpret and prioritize candidate designs rather than treating generation as a black box.

Impact

By coupling a reasoning language model to a structure-generating diffusion model, Proteo-R1 illustrates a broader trend of bringing explicit, residue-level reasoning into protein design instead of relying solely on end-to-end generation. Its large reported gain on RAbD CDR-H3 (DockQ 0.801 vs. 0.473) suggests that constraint extraction by an understanding expert can meaningfully improve downstream geometric generation for antibodies. As an ICML 2026 contribution with open code and downloadable checkpoints, it offers a concrete template for reasoning-guided design that other groups can build on. Note that the HuggingFace checkpoints currently ship without a model card or a stated weights license (distinct from the Apache 2.0 code), so users should verify licensing terms before deployment.

Citation

Proteo-R1: Reasoning Foundation Models for De Novo Protein Design

Preprint

Wu, F., et al. (2026) Proteo-R1: Reasoning Foundation Models for De Novo Protein Design. arXiv.

DOI: 10.48550/arXiv.2605.02937

Recent citations

Papers that recently cited this model.

Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
Fang Wu, Weihao Xuan, J. Leskovec, et al.
Jun 2026
0
MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback
Zehong Wang, Yijun Ma, Connor R. Schmidt, et al.
Jun 2026
0
SurfDesign: Effective Protein Design on Molecular Surfaces
Fang Wu, Shuting Jin, Xiangru Tang, et al.
May 2026
1

Top citations

The most-cited papers that cite this model.

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Xinwu Ye, Yicheng Mao, Jia Zhang, et al.
arXiv.org · Feb 2026
3
SurfDesign: Effective Protein Design on Molecular Surfaces
Fang Wu, Shuting Jin, Xiangru Tang, et al.
May 2026
1
MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback
Zehong Wang, Yijun Ma, Connor R. Schmidt, et al.
Jun 2026
0
Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction
Fang Wu, Weihao Xuan, J. Leskovec, et al.
Jun 2026
0

Citations

Total Citations1

Influential0

References60

GitHub

Stars62

Forks8

Open Issues1

Contributors2

Last Push2mo ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads3.2K

Likes0

Last Modified2mo ago

Fields of citing research

Computer Science100%
Biology50%
Chemistry25%
Physics25%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

53Partial

Usability — can I run it?69

Reproducibility — can I retrain it?26

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model HuggingFace Model

Key Features

Reasoning-then-design pipeline: A multimodal LLM identifies functionally critical residues and passes them as explicit constraints to the generator, separating molecular understanding from geometric generation rather than predicting structure end-to-end.

Dual-expert architecture: An understanding expert (Qwen3-4B paired with a Protenix encoder) handles residue-level reasoning, while a generation expert built on Boltz-1 conditional diffusion handles structure synthesis.

Antibody CDR specialization: The model targets de novo design of antibody CDR loops, including the difficult CDR-H3, the most variable and binding-relevant loop.

Inference-only release: The published checkpoints support an inference CLI (proteor1-prepare-cdr, proteor1-design); the framework ships with fixed weights and is not intended for user-side training.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction

Fang Wu, Weihao Xuan, J. Leskovec, et al.

Jun 2026

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Zehong Wang, Yijun Ma, Connor R. Schmidt, et al.

Jun 2026

SurfDesign: Effective Protein Design on Molecular Surfaces

Fang Wu, Shuting Jin, Xiangru Tang, et al.

May 2026

Top citations

The most-cited papers that cite this model.

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

Xinwu Ye, Yicheng Mao, Jia Zhang, et al.

arXiv.org · Feb 2026

SurfDesign: Effective Protein Design on Molecular Surfaces

Fang Wu, Shuting Jin, Xiangru Tang, et al.

May 2026

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Zehong Wang, Yijun Ma, Connor R. Schmidt, et al.

Jun 2026

Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction

Fang Wu, Weihao Xuan, J. Leskovec, et al.

Jun 2026

Proteo-R1

#Key Features

#Technical Details

#Applications

#Impact

Citation

Proteo-R1: Reasoning Foundation Models for De Novo Protein Design

Recent citations

Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

SurfDesign: Effective Protein Design on Molecular Surfaces

Top citations

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

SurfDesign: Effective Protein Design on Molecular Surfaces

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Proteo-R1

#Key Features

#Technical Details

#Applications

#Impact

Citation

Proteo-R1: Reasoning Foundation Models for De Novo Protein Design

Recent citations

Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

SurfDesign: Effective Protein Design on Molecular Surfaces

Top citations

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

SurfDesign: Effective Protein Design on Molecular Surfaces

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

Deciphering Fingerprints of 3D Molecular Surfaces for Accurate Epitope Prediction

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact