Overview

ProteinShake is a Python software framework developed by Tim Kucera, Carlos Oliver, Dexiong Chen, and Karsten Borgwardt at the Borgwardt Lab (ETH Zurich / Max Planck Institute) that standardizes the process of building datasets and benchmarks for deep learning on protein 3D structures. Published at NeurIPS 2023 in the Datasets and Benchmarks Track, it addresses a longstanding bottleneck in the field: the absence of a shared, reproducible foundation for evaluating structure-based protein models across diverse biological prediction tasks.

Before ProteinShake, researchers working on protein structure representation learning routinely implemented their own data processing pipelines, constructed their own train/validation/test splits, and selected their own evaluation metrics. This fragmentation made it nearly impossible to draw reliable conclusions about which model architectures and data representations actually performed best across different tasks. ProteinShake unifies this landscape by providing a single entry point to curated, pre-processed structural datasets paired with biologically meaningful prediction tasks and rigorous evaluation protocols.

The framework supports data sourced from the RCSB Protein Data Bank (PDB) and the AlphaFold Protein Structure Database (AlphaFoldDB), giving users access to both experimentally determined structures and high-quality predicted models. Installation is straightforward via pip install proteinshake, and datasets can be loaded in a single line of code, making the framework accessible to researchers with limited data engineering experience.

Key Features

Multi-source structural data: Provides pre-processed, cleaned protein 3D structures from both the RCSB PDB (experimentally determined structures) and AlphaFoldDB (predicted structures), with associated functional and taxonomic annotations.
Diverse benchmark tasks: Includes protein-level classification tasks such as Enzyme Commission (EC) number prediction, Gene Ontology (GO) term assignment, protein family (Pfam) classification, and SCOP structural class prediction, as well as residue-level tasks such as binding site identification.
Multiple 3D representations: Structures can be exported as graphs, voxel grids, or point clouds, enabling evaluation of graph neural networks, 3D convolutional networks, and point cloud architectures on the same underlying data.
Framework-agnostic integration: Native loaders for PyTorch, TensorFlow, NumPy, JAX, PyTorch Geometric, DGL, and NetworkX mean that researchers can use ProteinShake with their existing model training infrastructure without format conversion overhead.
Rigorous data splitting: Provides pre-computed splits based on both sequence similarity and structural similarity, including the most stringent structure-based splits that test true generalization to structurally novel proteins.
Extensible design: Users can define custom datasets and custom tasks using the same API, making ProteinShake a general scaffold for new benchmark development rather than a fixed collection.

Technical Details

ProteinShake is organized around three core abstractions: datasets (collections of structures with annotations), tasks (prediction objectives with associated metrics), and representations (format converters that transform raw structures into tensors suitable for deep learning). The preprocessing pipeline handles structure cleaning, residue-level annotation extraction, and graph construction (where each residue is a node and edges connect spatially proximal residues within a configurable distance threshold). Voxel and point cloud representations are constructed from the same cleaned coordinate data, ensuring that modality comparisons are not confounded by differences in preprocessing.

The benchmark component of ProteinShake is built around the three key findings reported in the NeurIPS 2023 paper: pre-training on large unlabeled structural datasets consistently improves downstream task performance; the optimal data representation (graph, voxel, or point cloud) varies by task and cannot be determined a priori; and current structure-based models exhibit limited ability to generalize to proteins that are structurally dissimilar to the training set. These findings were derived by evaluating multiple GNN, 3D-CNN, and point cloud model architectures across all provided tasks under identical experimental conditions.

Applications

ProteinShake is primarily a tool for researchers developing or evaluating protein structure representation learning methods. Machine learning researchers can use it to benchmark new architectures on a battery of biologically relevant tasks without assembling their own datasets, while computational biologists can use the pre-processed datasets as starting points for specialized fine-tuning or transfer learning experiments. The framework's flexible representation system also makes it suitable for ablation studies comparing 3D structural encodings, and its extensible task API enables groups developing new evaluation paradigms to contribute benchmarks that are immediately compatible with existing baseline models.

Impact

By standardizing the dataset and evaluation pipeline for structure-based protein deep learning, ProteinShake occupies a role analogous to benchmarks like ImageNet or GLUE in computer vision and NLP — establishing common ground for fair comparison. The NeurIPS 2023 paper's empirical finding that models generalize poorly to structurally novel proteins has direct implications for how the community should design training and evaluation sets, pushing toward harder, structure-based splits rather than the random or sequence-similarity splits commonly used at the time of publication. The framework has informed subsequent protein representation learning work and has spun off related projects including RNA-specific tooling (RNAGlib) and a broader biomolecule benchmark framework (bioverse). A key limitation is that ProteinShake focuses on single-chain structure classification tasks and does not currently cover protein-protein interactions, multi-chain assemblies, or dynamic conformational ensembles.

Overview

Key Features

Multi-source structural data: Provides pre-processed, cleaned protein 3D structures from both the RCSB PDB (experimentally determined structures) and AlphaFoldDB (predicted structures), with associated functional and taxonomic annotations.

Diverse benchmark tasks: Includes protein-level classification tasks such as Enzyme Commission (EC) number prediction, Gene Ontology (GO) term assignment, protein family (Pfam) classification, and SCOP structural class prediction, as well as residue-level tasks such as binding site identification.

Multiple 3D representations: Structures can be exported as graphs, voxel grids, or point clouds, enabling evaluation of graph neural networks, 3D convolutional networks, and point cloud architectures on the same underlying data.

Framework-agnostic integration: Native loaders for PyTorch, TensorFlow, NumPy, JAX, PyTorch Geometric, DGL, and NetworkX mean that researchers can use ProteinShake with their existing model training infrastructure without format conversion overhead.

Rigorous data splitting: Provides pre-computed splits based on both sequence similarity and structural similarity, including the most stringent structure-based splits that test true generalization to structurally novel proteins.

Extensible design: Users can define custom datasets and custom tasks using the same API, making ProteinShake a general scaffold for new benchmark development rather than a fixed collection.

Technical Details

Applications

Impact

ProteinShake

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources

ProteinShake

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources