BC-Design

Biochemistry-aware inverse protein design framework that augments backbone geometry with physicochemical point clouds, reaching ~90% sequence recovery on CATH 4.2.

Released: November 2024

BC-Design is a biochemistry-aware framework for inverse protein design developed in Mark Gerstein's lab at Yale University, first posted to bioRxiv in November 2024. Inverse protein design (and its narrower form, inverse protein folding) asks the reverse of structure prediction: given a fixed three-dimensional backbone, find amino acid sequences that fold into it. This is a central task in protein engineering, where tools such as ProteinMPNN, PiFold, and ESM-IF have set the prevailing standard.

The key idea of BC-Design is that backbone geometry alone underspecifies the design problem. Two residues with nearly identical local geometry can demand very different chemistry depending on whether they sit in a buried hydrophobic core or on a polar, solvent-exposed surface. BC-Design therefore augments the geometric description with continuous, smoothly varying physicochemical fields — hydrophobicity and charge — sampled as point clouds spanning both the protein surface and interior. A dedicated biochemistry encoder processes these fields and a fusion module ("BC-Fusion") combines them with structural features, letting the model condition sequence generation on chemical context rather than shape alone.

The authors report roughly 90% sequence recovery on the CATH 4.2 benchmark, a substantial jump over prior structure-only methods, with consistent gains across protein lengths, contact-order regimes, and major fold classes. Later preprint revisions reframe the work from pure inverse folding toward inverse design, emphasizing the generation of plausible functional variants rather than only recovering native sequences.

Key Features

Biochemistry-augmented inputs: Continuous hydrophobicity and charge fields are represented as point clouds over the protein surface and interior, supplying chemical context that backbone coordinates alone do not capture.
BC-Fusion architecture: A separate biochemistry encoder and a fusion module integrate physicochemical point-cloud features with geometric structure features before sequence decoding.
High sequence recovery: Reaches approximately 90% native sequence recovery on CATH 4.2, with robust generalization across fold classes, chain lengths, and contact-order regimes.
Controllable fidelity–diversity trade-off: Masking the biochemical features from 0% to 100% tunes outputs between faithful native-sequence recovery and more diverse candidate generation, including a backbone-only inference mode.
Demonstrated functional design: Case studies show increased enzyme–substrate affinity, improved peptide–receptor design accuracy, and state-of-the-art recovery and structural fidelity for antibody CDRH3 loop modeling.

Technical Details

BC-Design encodes a protein as two parallel streams: a structural representation of the backbone and a biochemical representation in which hydrophobicity and charge are sampled as smoothly varying point clouds across surface and interior coordinates. The biochemistry (BC) encoder processes the physicochemical point clouds, and the BC-Fusion module merges them with structural features before the sequence is decoded over the 20 canonical amino acids. The released implementation ships a single pretrained checkpoint, UBC2Model.ckpt (hosted on Hugging Face), which is used directly for inference — both backbone-only prediction and partial biochemical-feature masking reuse this fixed checkpoint, so no re-training is required to explore the masking-controlled trade-off. Evaluation spans CATH 4.2 (the primary benchmark, with ~90% reported recovery), along with the TS50, TS500, and AFDB2000 test sets; a full CATH 4.2 evaluation runs in roughly 3.5 hours on a single A100 GPU. Code and weights are both released under the Apache 2.0 license, while the preprint text is governed separately by CC BY-NC-ND.

Applications

BC-Design targets fixed-backbone sequence design for protein engineering, where the goal is to find sequences compatible with a target structure and function. Its biochemistry-aware conditioning is aimed at function-sensitive design problems: the paper demonstrates enzyme sequence design that increases substrate affinity, peptide design against receptor targets, and antibody engineering through CDRH3 loop modeling, the most variable and design-critical antibody loop. The masking control lets practitioners dial between high-fidelity native recovery — useful for stabilizing or resurfacing an existing scaffold — and greater sequence diversity for exploring functional variants, making it relevant to enzyme engineering, therapeutic peptide and antibody design, and de novo scaffold sequencing.

Impact

BC-Design contributes a distinct angle to the crowded inverse-folding landscape: rather than scaling geometric encoders, it argues that explicit physicochemical context — hydrophobicity and charge as continuous spatial fields — is a missing signal that materially improves sequence recovery and functional design. The reported ~90% CATH 4.2 recovery is well above the levels established by structure-only baselines such as ProteinMPNN and PiFold, and the antibody, enzyme, and peptide case studies extend the claim from benchmark recovery toward practical function-aware design. As an early-stage preprint with a single released checkpoint, its broader adoption and independent validation are still developing, and head-to-head comparisons under matched evaluation protocols will determine how its gains hold up; the open Apache 2.0 release of both code and weights lowers the barrier for that scrutiny.

Citation

BC-Design: A Biochemistry-Aware Framework for Inverse Protein Design

Preprint

Tang, X., et al. (2025) BC-Design: A Biochemistry-Aware Framework for Inverse Protein Design. bioRxiv.

DOI: 10.1101/2024.10.28.620755

Recent citations

Papers that recently cited this model.

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization
Xinwu Ye, He Cao, Hao Li, et al.
May 2026
0
Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model
Guanlue Li, Xufeng Zhao, Fang Wu, et al.
arXiv.org · Nov 2025
0
Surface-based Molecular Design with Multi-modal Flow Matching
Fan Wu, Zhengyuan Zhou, Shuting Jin, et al.
Knowledge Discovery and Data Mining · Aug 2025
4

Top citations

The most-cited papers that cite this model.

Surface-based Molecular Design with Multi-modal Flow Matching
Fan Wu, Zhengyuan Zhou, Shuting Jin, et al.
Knowledge Discovery and Data Mining · Aug 2025
4
Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization
Xinwu Ye, He Cao, Hao Li, et al.
May 2026
0
Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model
Guanlue Li, Xufeng Zhao, Fang Wu, et al.
arXiv.org · Nov 2025
0

Citations

Total Citations3

Influential0

References60

GitHub

Stars21

Forks4

Open Issues1

Contributors7

Last Push7mo ago

LanguagePython

LicenseApache-2.0

Fields of citing research

Computer Science100%
Biology67%
Medicine33%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

75Open

Usability — can I run it?86

Reproducibility — can I retrain it?84

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Biochemistry-augmented inputs: Continuous hydrophobicity and charge fields are represented as point clouds over the protein surface and interior, supplying chemical context that backbone coordinates alone do not capture.

BC-Fusion architecture: A separate biochemistry encoder and a fusion module integrate physicochemical point-cloud features with geometric structure features before sequence decoding.

High sequence recovery: Reaches approximately 90% native sequence recovery on CATH 4.2, with robust generalization across fold classes, chain lengths, and contact-order regimes.

Controllable fidelity–diversity trade-off: Masking the biochemical features from 0% to 100% tunes outputs between faithful native-sequence recovery and more diverse candidate generation, including a backbone-only inference mode.

Demonstrated functional design: Case studies show increased enzyme–substrate affinity, improved peptide–receptor design accuracy, and state-of-the-art recovery and structural fidelity for antibody CDRH3 loop modeling.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Xinwu Ye, He Cao, Hao Li, et al.

May 2026

Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model

Guanlue Li, Xufeng Zhao, Fang Wu, et al.

arXiv.org · Nov 2025

Surface-based Molecular Design with Multi-modal Flow Matching

Fan Wu, Zhengyuan Zhou, Shuting Jin, et al.

Knowledge Discovery and Data Mining · Aug 2025

Top citations

The most-cited papers that cite this model.

Surface-based Molecular Design with Multi-modal Flow Matching

Fan Wu, Zhengyuan Zhou, Shuting Jin, et al.

Knowledge Discovery and Data Mining · Aug 2025

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Xinwu Ye, He Cao, Hao Li, et al.

May 2026

Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model

Guanlue Li, Xufeng Zhao, Fang Wu, et al.

arXiv.org · Nov 2025

BC-Design

#Key Features

#Technical Details

#Applications

#Impact

Citation

BC-Design: A Biochemistry-Aware Framework for Inverse Protein Design

Recent citations

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Top citations

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Citations

GitHub

Fields of citing research

Openness

Resources

BC-Design

#Key Features

#Technical Details

#Applications

#Impact

Citation

BC-Design: A Biochemistry-Aware Framework for Inverse Protein Design

Recent citations

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Top citations

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Citations

GitHub

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact