BioReason-Pro

Arc Institute / UCSF / University of Toronto / Vector Institute

Multimodal reasoning LLM for protein function prediction, fusing protein language model embeddings to emit interpretable GO-term reasoning traces.

Released: March 2026

BioReason-Pro is a multimodal reasoning large language model for protein function prediction, developed by the Arc Institute together with UCSF and the Bo Wang Lab (University of Toronto / Vector Institute / University Health Network) and released as a bioRxiv preprint in March 2026. It reframes function annotation — typically a multi-label classification problem over Gene Ontology (GO) terms — as a reasoning task, generating step-by-step explanations that connect a protein's domain architecture, interaction partners, and organism context to its molecular function, biological processes, and cellular localization.

Architecturally, BioReason-Pro fuses protein embeddings from the ESM3 protein language model with the Qwen3 language model, allowing it to ground natural-language reasoning in learned structural and evolutionary representations. A companion model, GO-GPT — an autoregressive transformer over GO terms that captures the ontology's hierarchical and cross-aspect dependencies — supplies structured label priors to the reasoning pipeline.

The model addresses a long-standing weakness of black-box function predictors: their outputs are labels without rationale. By emitting explicit reasoning traces, BioReason-Pro produces annotations that human experts can scrutinize, and in a blinded comparison experts preferred its annotations over ground-truth UniProt entries in 79% of cases.

Key Features

Reasoning-based annotation: Instead of emitting bare GO labels, the model generates structured reasoning traces linking sequence features, interaction partners, and organism context to predicted functions.
Multimodal fusion: ESM3 protein embeddings are integrated with the Qwen3 LLM, grounding language reasoning in learned protein representations.
GO-GPT label prior: A companion autoregressive transformer models hierarchical and cross-aspect GO-term dependencies, supplying structured priors to the reasoner.
SFT plus reinforcement learning: Trained by supervised fine-tuning on synthetic reasoning traces, then optimized with reinforcement learning to sharpen accuracy.
Open release: Code, model weights, a web demo, and precomputed predictions for 223,000+ proteins are publicly available.

Technical Details

BioReason-Pro is built by coupling ESM3 protein embeddings with the Qwen3 language model. It is trained via supervised fine-tuning on synthetic reasoning traces generated by GPT-5 for over 130,000 proteins, then further optimized through reinforcement learning. On GO term prediction it reaches 73.6% Fmax, and an LLM judge assigns its functional summaries a score of 8/10. The inference pipeline runs on a single GPU and covers 200+ organisms. Weights for GO-GPT and both the SFT and RL variants of BioReason-Pro are released on HuggingFace, with precomputed predictions for 223,000+ proteins available as a dataset and through the web demo.

Applications

BioReason-Pro is aimed at researchers who need not just a predicted function but a defensible rationale — for example, annotating proteins of unknown function from new genomes and metagenomes, prioritizing targets in therapeutic discovery, and curating or auditing existing database entries. The interpretable reasoning traces make it suitable as an assistant for expert biocurators, and its single-GPU inference and 200+ organism coverage lower the barrier for routine use.

Impact

By treating protein annotation as an interpretable reasoning problem and pairing it with an open release of code, weights, and large-scale precomputed predictions, BioReason-Pro pushes function prediction toward transparency and auditability rather than opaque scoring. The finding that experts preferred its annotations over UniProt ground truth in 79% of blinded cases is a notable signal of practical quality, though such expert-preference results depend on evaluation design and merit independent replication. The work also extends the Bo Wang Lab's earlier BioReason line, which coupled DNA foundation models with LLMs, to the protein domain.

Citation

BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning

Fallahpour, A., et al. (2026) BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning. bioRxiv.

DOI: 10.64898/2026.03.19.712954

Recent citations

Papers that recently cited this model.

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology
Hyunjin Seo, Hyeon Hwang, Gyubok Lee, et al.
Jul 2026
0
Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3
Jay Jung, Xiaohang Zhang, Shenghan Song, et al.
arXiv.org · Jun 2026
0
How Post-Training Shapes Biological Reasoning Models
Lukas Fesser, Hanlin Zhang, Michelle M. Li, et al.
Jun 2026
0Influential

Top citations

The most-cited papers that cite this model.

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3
Jay Jung, Xiaohang Zhang, Shenghan Song, et al.
arXiv.org · Jun 2026
0
Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation
Anvita Gupta, Anshul B Kundaje, Alejandro Buendia, et al.
bioRxiv · May 2026
0
Allos: an integrated Python toolkit for isoform-level single-cell and spatial in-situ transcriptomics
Eamon M McAndrew, Anna Diamant, Georges Vassaux, et al.
bioRxiv · Mar 2026
0
How Post-Training Shapes Biological Reasoning Models
Lukas Fesser, Hanlin Zhang, Michelle M. Li, et al.
Jun 2026
0Influential
MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification
Jueon Park, W. Jang, Jiwoo Lee, et al.
May 2026
0

Citations

Total Citations9

Influential1

References0

GitHub

Stars122

Forks15

Open Issues1

Contributors1

Last Push1mo ago

LanguageJupyter Notebook

Fields of citing research

Computer Science100%
Biology78%
Medicine44%
Chemistry22%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

58Partial

Usability — can I run it?55

Reproducibility — can I retrain it?45

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website Demo

Key Features

Reasoning-based annotation: Instead of emitting bare GO labels, the model generates structured reasoning traces linking sequence features, interaction partners, and organism context to predicted functions.

Multimodal fusion: ESM3 protein embeddings are integrated with the Qwen3 LLM, grounding language reasoning in learned protein representations.

GO-GPT label prior: A companion autoregressive transformer models hierarchical and cross-aspect GO-term dependencies, supplying structured priors to the reasoner.

SFT plus reinforcement learning: Trained by supervised fine-tuning on synthetic reasoning traces, then optimized with reinforcement learning to sharpen accuracy.

Open release: Code, model weights, a web demo, and precomputed predictions for 223,000+ proteins are publicly available.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

Hyunjin Seo, Hyeon Hwang, Gyubok Lee, et al.

Jul 2026

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

Jay Jung, Xiaohang Zhang, Shenghan Song, et al.

arXiv.org · Jun 2026

How Post-Training Shapes Biological Reasoning Models

Lukas Fesser, Hanlin Zhang, Michelle M. Li, et al.

Jun 2026

0Influential

Top citations

The most-cited papers that cite this model.

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

Jay Jung, Xiaohang Zhang, Shenghan Song, et al.

arXiv.org · Jun 2026

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

Anvita Gupta, Anshul B Kundaje, Alejandro Buendia, et al.

bioRxiv · May 2026

Allos: an integrated Python toolkit for isoform-level single-cell and spatial in-situ transcriptomics

Eamon M McAndrew, Anna Diamant, Georges Vassaux, et al.

bioRxiv · Mar 2026

How Post-Training Shapes Biological Reasoning Models

Lukas Fesser, Hanlin Zhang, Michelle M. Li, et al.

Jun 2026

0Influential

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification

Jueon Park, W. Jang, Jiwoo Lee, et al.

May 2026

BioReason-Pro

#Key Features

#Technical Details

#Applications

#Impact

Citation

BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning

Recent citations

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

How Post-Training Shapes Biological Reasoning Models

Top citations

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

Allos: an integrated Python toolkit for isoform-level single-cell and spatial in-situ transcriptomics

How Post-Training Shapes Biological Reasoning Models

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

BioReason-Pro

#Key Features

#Technical Details

#Applications

#Impact

Citation

BioReason-Pro: Advancing Protein Function Prediction with Multimodal Biological Reasoning

Recent citations

TheBioCollection: Unified Pre-Training Scale LLM Corpus for Biology

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

How Post-Training Shapes Biological Reasoning Models

Top citations

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

Allos: an integrated Python toolkit for isoform-level single-cell and spatial in-situ transcriptomics

How Post-Training Shapes Biological Reasoning Models

MolDeTox: Evaluating Language Model's Stepwise Fragment Editing for Molecular Detoxification

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact