bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Language model foundation models
Language modelProtein

BioBridge

Tongji University / Fudan University

A framework that bridges a protein language model and a general LLM via continual pretraining, giving large language models protein-domain reasoning without losing general capabilities.

Released: February 2026

Protein language models (PLMs) such as ESM2 capture rich sequence-level biology but cannot reason in natural language, while general-purpose large language models (LLMs) reason fluently but lack grounded protein knowledge. BioBridge, described in a February 2026 arXiv preprint from researchers at Chinese institutions (including Tongji and Fudan University), aims to combine these strengths: to give an LLM genuine protein understanding while preserving the broad reasoning and knowledge it already has.

The central challenge is catastrophic forgetting—naively fine-tuning an LLM on protein data degrades its general abilities. BioBridge addresses this with Domain-Incremental Continual Pre-training (DICP), which infuses protein-domain knowledge alongside a general reasoning corpus so the model gains specialized competence without sacrificing its original skills. A cross-modal projector connects a frozen PLM's protein embeddings into the LLM's semantic space, letting the language model attend to protein representations as if they were another modality.

#Key Features

  • PLM-projector-LLM architecture: A frozen protein language model is connected to an LLM through a learned projector, aligning protein embeddings with the language model's semantic space.
  • Domain-Incremental Continual Pre-training (DICP): Trains on protein knowledge and a general reasoning corpus together to inject domain expertise while mitigating catastrophic forgetting.
  • Dual competence: Targets competitive results on protein benchmarks (EC, BindingDB) while remaining on par with the base LLM on general tasks (MMLU, RACE).
  • Multi-task and conversational: Supports protein property prediction and knowledge-based question answering within a single language-model interface.

#Technical Details

BioBridge uses ESM2 as a frozen protein encoder and Qwen2.5-7B-Instruct as the language backbone. A Q-Former-style projector extracts a fixed number of query tokens from protein embeddings via cross-attention, producing protein representations that the LLM consumes alongside text. Training follows the DICP recipe, interleaving protein-domain data with a general corpus to limit forgetting. Reported results include localization (DeepLoc multi) at 0.815 versus ESM2's 0.759, metal-ion binding at 0.761 versus 0.712, and EC annotation at 0.743, while general performance (e.g., MMLU 63.30 versus the base model's 70.41) is largely retained. As of this preprint, the authors note no released weights or code; architecture and benchmark figures should be confirmed against the paper.

#Applications

BioBridge is aimed at protein scientists who want to query and reason about proteins through a conversational language interface—asking about properties, function, or binding while receiving answers grounded in a protein encoder. By unifying protein property prediction and free-form question answering in one model, it points toward assistant-style tools that combine PLM accuracy with LLM usability for tasks such as annotation triage and hypothesis generation.

#Impact

BioBridge contributes to a growing line of work that fuses biomolecular encoders with general LLMs, and its focus on continual pretraining to avoid catastrophic forgetting is a notable design choice in that space. Its practical reach is currently limited by the absence of released weights or code, and as a February 2026 preprint its results await peer review and independent reproduction.

Tags

protein_property_predictionquestion_answeringprotein_function_predictiontransformerlanguage_modelcontinual_learningmultimodalproteomics