ProteinInvBench is a standardized benchmarking framework for protein inverse folding, introduced at NeurIPS 2023 as part of the Datasets and Benchmarks Track. Protein inverse folding — the task of designing amino acid sequences that will fold into a specified three-dimensional backbone structure — is a foundational problem for computational protein design. Despite rapid progress in this area, prior work had converged on a single dataset (CATH) and a single metric (sequence recovery), making it difficult to compare methods rigorously or understand their behavior in practically relevant scenarios. ProteinInvBench addresses this gap by providing a unified framework that spans multiple tasks, integrates eight published methods under identical conditions, and evaluates them across six complementary metrics.
The benchmark was developed by Zhangyang Gao, Cheng Tan, Yijie Zhang, Xingran Chen, Lirong Wu, and Stan Z. Li from the A4Bio research group. Beyond re-evaluating existing methods on updated data, ProteinInvBench extends the scope of the field by adapting models originally designed for single-chain proteins to two additional scenarios: multi-chain protein complexes and de novo backbone scaffolds. This expansion reflects the reality that most biologically and therapeutically relevant proteins operate as oligomers or require backbone geometries not represented in the training distribution.
By providing a shared codebase and reproducible evaluation pipeline, ProteinInvBench enables direct comparison between methods that had previously been assessed in incompatible experimental setups. The accompanying software repository, also distributed as OpenCPD, is designed so that researchers can add new models to the framework and re-run evaluations with minimal overhead.
ProteinInvBench is not a single neural network but a benchmarking infrastructure. The integrated models span the dominant architectural paradigms in structure-conditioned sequence design: graph neural networks (GraphTrans, StructGNN, GVP, GCA, AlphaDesign), autoregressive models (ProteinMPNN), and models incorporating memory retrieval or iterative refinement (PiFold, KWDesign). Each model receives protein backbone coordinates as input and produces a probability distribution over amino acid identities at each position. All methods are re-trained and evaluated on identical data splits to eliminate confounding differences in preprocessing, splits, or training duration.
The benchmark reveals meaningful performance differences across tasks and metrics that were invisible under single-metric evaluation. KWDesign achieves the highest sequence recovery across perturbation noise scales, with PiFold as a close competitor, while rankings shift when diversity or computational efficiency is prioritized. Across robustness evaluations under backbone coordinate perturbations, the relative ordering of methods changes substantially, highlighting that recovery on clean data does not reliably predict performance under realistic noise conditions. Training a single epoch is feasible for all models except KWDesign without memory retrieval, and PiFold and KWDesign converge within 20 epochs, suggesting efficiency is not a barrier to adoption for most methods in the benchmark.
ProteinInvBench is primarily a tool for researchers developing or selecting protein inverse folding methods. Computational protein engineers can use the benchmark to identify which method performs best for their specific use case — for example, prioritizing diversity when designing sequence libraries for directed evolution, or prioritizing sc-TM scores when designing sequences intended for experimental validation. Method developers can use the framework to evaluate new models against a standardized set of baselines across all three design tasks without reimplementing competing methods. The multi-chain extension makes the benchmark relevant for antibody and protein complex design, where existing single-chain-only evaluations are insufficient.
ProteinInvBench contributed to establishing more rigorous evaluation standards in the protein inverse folding literature, which had been criticized for inconsistent benchmarking practices. By demonstrating that method rankings change depending on the metric and task considered, the work motivates the community to report results across multiple evaluation dimensions rather than optimizing solely for sequence recovery. The paper has been cited by subsequent inverse folding methods including those that evaluate on sc-TM and diversity as standard practice. A notable limitation of the benchmark is that it focuses on backbone-conditioned design and does not evaluate partial-structure conditioning, side-chain packing quality, or experimental success rates — aspects that remain active areas of benchmark development in the field.