bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

BC-Design

Gerstein Lab / Yale University

Biochemistry-aware inverse protein design framework that augments backbone geometry with physicochemical point clouds, reaching ~90% sequence recovery on CATH 4.2.

Released: November 2024

BC-Design is a biochemistry-aware framework for inverse protein design developed in Mark Gerstein's lab at Yale University, first posted to bioRxiv in November 2024. Inverse protein design (and its narrower form, inverse protein folding) asks the reverse of structure prediction: given a fixed three-dimensional backbone, find amino acid sequences that fold into it. This is a central task in protein engineering, where tools such as ProteinMPNN, PiFold, and ESM-IF have set the prevailing standard.

The key idea of BC-Design is that backbone geometry alone underspecifies the design problem. Two residues with nearly identical local geometry can demand very different chemistry depending on whether they sit in a buried hydrophobic core or on a polar, solvent-exposed surface. BC-Design therefore augments the geometric description with continuous, smoothly varying physicochemical fields — hydrophobicity and charge — sampled as point clouds spanning both the protein surface and interior. A dedicated biochemistry encoder processes these fields and a fusion module ("BC-Fusion") combines them with structural features, letting the model condition sequence generation on chemical context rather than shape alone.

The authors report roughly 90% sequence recovery on the CATH 4.2 benchmark, a substantial jump over prior structure-only methods, with consistent gains across protein lengths, contact-order regimes, and major fold classes. Later preprint revisions reframe the work from pure inverse folding toward inverse design, emphasizing the generation of plausible functional variants rather than only recovering native sequences.

#Key Features

  • Biochemistry-augmented inputs: Continuous hydrophobicity and charge fields are represented as point clouds over the protein surface and interior, supplying chemical context that backbone coordinates alone do not capture.
  • BC-Fusion architecture: A separate biochemistry encoder and a fusion module integrate physicochemical point-cloud features with geometric structure features before sequence decoding.
  • High sequence recovery: Reaches approximately 90% native sequence recovery on CATH 4.2, with robust generalization across fold classes, chain lengths, and contact-order regimes.
  • Controllable fidelity–diversity trade-off: Masking the biochemical features from 0% to 100% tunes outputs between faithful native-sequence recovery and more diverse candidate generation, including a backbone-only inference mode.
  • Demonstrated functional design: Case studies show increased enzyme–substrate affinity, improved peptide–receptor design accuracy, and state-of-the-art recovery and structural fidelity for antibody CDRH3 loop modeling.

#Technical Details

BC-Design encodes a protein as two parallel streams: a structural representation of the backbone and a biochemical representation in which hydrophobicity and charge are sampled as smoothly varying point clouds across surface and interior coordinates. The biochemistry (BC) encoder processes the physicochemical point clouds, and the BC-Fusion module merges them with structural features before the sequence is decoded over the 20 canonical amino acids. The released implementation ships a single pretrained checkpoint, UBC2Model.ckpt (hosted on Hugging Face), which is used directly for inference — both backbone-only prediction and partial biochemical-feature masking reuse this fixed checkpoint, so no re-training is required to explore the masking-controlled trade-off. Evaluation spans CATH 4.2 (the primary benchmark, with ~90% reported recovery), along with the TS50, TS500, and AFDB2000 test sets; a full CATH 4.2 evaluation runs in roughly 3.5 hours on a single A100 GPU. Code and weights are both released under the Apache 2.0 license, while the preprint text is governed separately by CC BY-NC-ND.

#Applications

BC-Design targets fixed-backbone sequence design for protein engineering, where the goal is to find sequences compatible with a target structure and function. Its biochemistry-aware conditioning is aimed at function-sensitive design problems: the paper demonstrates enzyme sequence design that increases substrate affinity, peptide design against receptor targets, and antibody engineering through CDRH3 loop modeling, the most variable and design-critical antibody loop. The masking control lets practitioners dial between high-fidelity native recovery — useful for stabilizing or resurfacing an existing scaffold — and greater sequence diversity for exploring functional variants, making it relevant to enzyme engineering, therapeutic peptide and antibody design, and de novo scaffold sequencing.

#Impact

BC-Design contributes a distinct angle to the crowded inverse-folding landscape: rather than scaling geometric encoders, it argues that explicit physicochemical context — hydrophobicity and charge as continuous spatial fields — is a missing signal that materially improves sequence recovery and functional design. The reported ~90% CATH 4.2 recovery is well above the levels established by structure-only baselines such as ProteinMPNN and PiFold, and the antibody, enzyme, and peptide case studies extend the claim from benchmark recovery toward practical function-aware design. As an early-stage preprint with a single released checkpoint, its broader adoption and independent validation are still developing, and head-to-head comparisons under matched evaluation protocols will determine how its gains hold up; the open Apache 2.0 release of both code and weights lowers the barrier for that scrutiny.

Citation

BC-Design: A Biochemistry-Aware Framework for Inverse Protein Design

Preprint

Tang, X., et al. (2025) BC-Design: A Biochemistry-Aware Framework for Inverse Protein Design. bioRxiv.

DOI: 10.1101/2024.10.28.620755

Recent citations

Papers that recently cited this model.

  • Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

    Xinwu Ye, He Cao, Hao Li, et al.

    May 2026

    0
  • Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model

    Guanlue Li, Xufeng Zhao, Fang Wu, et al.

    arXiv.org · Nov 2025

    0
  • Surface-based Molecular Design with Multi-modal Flow Matching

    Fan Wu, Zhengyuan Zhou, Shuting Jin, et al.

    Knowledge Discovery and Data Mining · Aug 2025

    4

Top citations

The most-cited papers that cite this model.

  • Surface-based Molecular Design with Multi-modal Flow Matching

    Fan Wu, Zhengyuan Zhou, Shuting Jin, et al.

    Knowledge Discovery and Data Mining · Aug 2025

    4
  • Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

    Xinwu Ye, He Cao, Hao Li, et al.

    May 2026

    0
  • Joint Design of Protein Surface and Structure Using a Diffusion Bridge Model

    Guanlue Li, Xufeng Zhao, Fang Wu, et al.

    arXiv.org · Nov 2025

    0

Citations

Total Citations3
Influential0
References60

GitHub

Stars21
Forks4
Open Issues1
Contributors7
Last Push7mo ago
LanguagePython
LicenseApache-2.0

Fields of citing research

  • Computer Science100%
  • Biology67%
  • Medicine33%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible
75Open
Usability — can I run it?86
Reproducibility — can I retrain it?84
Model Openness Framework
Unclassified
Restrictive license on core components

Resources

GitHub RepositoryResearch PaperHuggingFace Model