Tongji University / Fudan University
A framework that bridges a protein language model and a general LLM via continual pretraining, giving large language models protein-domain reasoning without losing general capabilities.
Protein language models (PLMs) such as ESM2 capture rich sequence-level biology but cannot reason in natural language, while general-purpose large language models (LLMs) reason fluently but lack grounded protein knowledge. BioBridge, described in a February 2026 arXiv preprint from researchers at Chinese institutions (including Tongji and Fudan University), aims to combine these strengths: to give an LLM genuine protein understanding while preserving the broad reasoning and knowledge it already has.
The central challenge is catastrophic forgetting—naively fine-tuning an LLM on protein data degrades its general abilities. BioBridge addresses this with Domain-Incremental Continual Pre-training (DICP), which infuses protein-domain knowledge alongside a general reasoning corpus so the model gains specialized competence without sacrificing its original skills. A cross-modal projector connects a frozen PLM's protein embeddings into the LLM's semantic space, letting the language model attend to protein representations as if they were another modality.
BioBridge uses ESM2 as a frozen protein encoder and Qwen2.5-7B-Instruct as the language backbone. A Q-Former-style projector extracts a fixed number of query tokens from protein embeddings via cross-attention, producing protein representations that the LLM consumes alongside text. Training follows the DICP recipe, interleaving protein-domain data with a general corpus to limit forgetting. Reported results include localization (DeepLoc multi) at 0.815 versus ESM2's 0.759, metal-ion binding at 0.761 versus 0.712, and EC annotation at 0.743, while general performance (e.g., MMLU 63.30 versus the base model's 70.41) is largely retained. As of this preprint, the authors note no released weights or code; architecture and benchmark figures should be confirmed against the paper.
BioBridge is aimed at protein scientists who want to query and reason about proteins through a conversational language interface—asking about properties, function, or binding while receiving answers grounded in a protein encoder. By unifying protein property prediction and free-form question answering in one model, it points toward assistant-style tools that combine PLM accuracy with LLM usability for tasks such as annotation triage and hypothesis generation.
BioBridge contributes to a growing line of work that fuses biomolecular encoders with general LLMs, and its focus on continual pretraining to avoid catastrophic forgetting is a notable design choice in that space. Its practical reach is currently limited by the absence of released weights or code, and as a February 2026 preprint its results await peer review and independent reproduction.