Python framework for building standardized protein structure datasets and benchmarks, with pre-processed data from PDB and AlphaFoldDB for deep learning evaluation.
ProteinShake is a Python software framework developed by Tim Kucera, Carlos Oliver, Dexiong Chen, and Karsten Borgwardt at the Borgwardt Lab (ETH Zurich / Max Planck Institute) that standardizes the process of building datasets and benchmarks for deep learning on protein 3D structures. Published at NeurIPS 2023 in the Datasets and Benchmarks Track, it addresses a longstanding bottleneck in the field: the absence of a shared, reproducible foundation for evaluating structure-based protein models across diverse biological prediction tasks.
Before ProteinShake, researchers working on protein structure representation learning routinely implemented their own data processing pipelines, constructed their own train/validation/test splits, and selected their own evaluation metrics. This fragmentation made it nearly impossible to draw reliable conclusions about which model architectures and data representations actually performed best across different tasks. ProteinShake unifies this landscape by providing a single entry point to curated, pre-processed structural datasets paired with biologically meaningful prediction tasks and rigorous evaluation protocols.
The framework supports data sourced from the RCSB Protein Data Bank (PDB) and the AlphaFold Protein Structure Database (AlphaFoldDB), giving users access to both experimentally determined structures and high-quality predicted models. Installation is straightforward via pip install proteinshake, and datasets can be loaded in a single line of code, making the framework accessible to researchers with limited data engineering experience.
ProteinShake is organized around three core abstractions: datasets (collections of structures with annotations), tasks (prediction objectives with associated metrics), and representations (format converters that transform raw structures into tensors suitable for deep learning). The preprocessing pipeline handles structure cleaning, residue-level annotation extraction, and graph construction (where each residue is a node and edges connect spatially proximal residues within a configurable distance threshold). Voxel and point cloud representations are constructed from the same cleaned coordinate data, ensuring that modality comparisons are not confounded by differences in preprocessing.
The benchmark component of ProteinShake is built around the three key findings reported in the NeurIPS 2023 paper: pre-training on large unlabeled structural datasets consistently improves downstream task performance; the optimal data representation (graph, voxel, or point cloud) varies by task and cannot be determined a priori; and current structure-based models exhibit limited ability to generalize to proteins that are structurally dissimilar to the training set. These findings were derived by evaluating multiple GNN, 3D-CNN, and point cloud model architectures across all provided tasks under identical experimental conditions.
ProteinShake is primarily a tool for researchers developing or evaluating protein structure representation learning methods. Machine learning researchers can use it to benchmark new architectures on a battery of biologically relevant tasks without assembling their own datasets, while computational biologists can use the pre-processed datasets as starting points for specialized fine-tuning or transfer learning experiments. The framework's flexible representation system also makes it suitable for ablation studies comparing 3D structural encodings, and its extensible task API enables groups developing new evaluation paradigms to contribute benchmarks that are immediately compatible with existing baseline models.
By standardizing the dataset and evaluation pipeline for structure-based protein deep learning, ProteinShake occupies a role analogous to benchmarks like ImageNet or GLUE in computer vision and NLP — establishing common ground for fair comparison. The NeurIPS 2023 paper's empirical finding that models generalize poorly to structurally novel proteins has direct implications for how the community should design training and evaluation sets, pushing toward harder, structure-based splits rather than the random or sequence-similarity splits commonly used at the time of publication. The framework has informed subsequent protein representation learning work and has spun off related projects including RNA-specific tooling (RNAGlib) and a broader biomolecule benchmark framework (bioverse). A key limitation is that ProteinShake focuses on single-chain structure classification tasks and does not currently cover protein-protein interactions, multi-chain assemblies, or dynamic conformational ensembles.