A holistic evaluation framework for protein foundation models, assessing 25+ models across 8 tasks using four-dimensional metrics: quality, novelty, diversity, and robustness.
ProteinBench is a comprehensive evaluation framework designed to assess protein foundation models across the breadth of tasks they are deployed for in research and drug discovery. Developed by researchers at ByteDance Research, the framework addresses a critical gap in the field: despite the rapid proliferation of protein foundation models for design, folding, and dynamics, no unified standard existed for comparing their capabilities across diverse objectives. ProteinBench establishes that standard by organizing evaluation around a principled taxonomy of protein tasks and a multi-dimensional scoring approach that captures what matters to practitioners, not just peak performance on a single metric.
The framework covers eight task categories spanning three major problem areas: protein design (inverse folding, backbone design, sequence design, structure-sequence co-design, motif scaffolding, and antibody design), three-dimensional structure prediction, and conformational dynamics (single-state and multi-state/ensemble prediction). Across these tasks, ProteinBench evaluates more than 25 models, including widely used systems such as RFdiffusion, ProteinMPNN, ESM3, AlphaFold2, Chroma, EvoDiff, and AlphaFlow, among others.
A central insight behind ProteinBench is that different researchers have different priorities. A laboratory designing de novo enzymes cares about novelty and structural quality. A group targeting a therapeutic protein may prioritize diversity across a sampled ensemble. By decomposing performance into four orthogonal dimensions — quality, novelty, diversity, and robustness — ProteinBench allows users to consult the leaderboard relative to their specific objective rather than relying on a single aggregate score that obscures tradeoffs.
ProteinBench is not a trained model but an evaluation framework, so its technical contribution lies in the design of metrics and datasets rather than architectural innovations. Quality for generative tasks is assessed using structure prediction as an oracle: generated sequences or backbones are folded by ESMFold and scored against the original input structure using self-consistency TM-score (scTM) and self-consistency RMSD (scRMSD), with pLDDT used as an additional confidence proxy. Novelty is quantified by comparing generated structures to the entire PDB using Foldseek's fast structural alignment, computing the maximum TM-score across all database hits; structures with lower maximum TM-scores represent more genuinely novel folds. Diversity is measured by pairwise TM-scores within a sampled batch, with additional structural clustering. Robustness evaluations challenge models with inputs outside their training distribution, such as de novo backbones for sequence design models trained primarily on PDB chains.
Key empirical findings from the benchmark include substantial length-dependent performance degradation across all backbone design methods beyond 300 residues, a consistent quality-diversity tradeoff in sequence design (DPLM achieves the highest pLDDT scores around 85-93 but lower diversity; EvoDiff achieves broader structural diversity but lower quality), and a clear advantage for MSA-based folding methods over language-model-only approaches for single-state structure prediction. In antibody CDR-H3 design, dyMEAN achieved the highest amino acid recovery rate (40.95%) but all evaluated methods showed substantial gaps relative to natural antibody benchmarks, underscoring that antibody design remains an open challenge.
ProteinBench is primarily a tool for the research community rather than an end-user application. It is most useful to computational biologists and ML researchers who are selecting a protein foundation model for a specific task and need to know which system will best serve their objective. Drug discovery teams evaluating generative design tools, academic groups benchmarking new model architectures, and platform developers building protein AI pipelines can all use the leaderboard and associated toolkit to make informed, evidence-based choices. The modular codebase supports adding custom models and evaluation tasks, making it a practical infrastructure layer for ongoing comparative research.
ProteinBench establishes one of the first rigorous, multi-task benchmarking standards for protein foundation models, addressing a recognized reproducibility and comparability problem in the field. Its central finding — that no single model currently excels across all protein design objectives — has practical implications for how researchers should approach model selection and how developers should frame claims about model performance. The four-dimensional evaluation schema provides a vocabulary for discussing model tradeoffs that is now available to the broader community through the public leaderboard. Released in September 2024 as a preprint, the framework arrived at a moment when the protein AI landscape had become crowded enough that principled benchmarking tools are essential infrastructure, and it is expected to serve as a reference point for future model evaluations as the field continues to evolve.
Ye, F., et al. (2024) ProteinBench: A Holistic Evaluation of Protein Foundation Models. International Conference on Learning Representations.
DOI: 10.48550/arXiv.2409.06744