Open-source Python library for preprocessing protein structure data from PDB and SAbDab into curated datasets for deep learning, with built-in filtering and clustering.
ProteinFlow is an open-source Python library developed by Adaptyv Biosystems that standardizes the preprocessing of protein structure data for deep learning research. Released in September 2023, it addresses a persistent bottleneck in the field: the absence of a consistent, reproducible pipeline for transforming raw PDB and SAbDab entries into well-curated, feature-rich datasets. Without such tooling, researchers routinely build bespoke preprocessing scripts that differ in filtering logic, clustering strategy, and output format — making results difficult to compare and experiments hard to reproduce.
The library covers the full preprocessing workflow from data acquisition to dataset splitting. It downloads protein structures from the Protein Data Bank (PDB) or the Structural Antibody Database (SAbDab), applies configurable quality filters, clusters entries by sequence identity to prevent data leakage, extracts structural features, and partitions data into training, validation, and test sets. ProteinFlow supports all levels of protein organization — single chains, homomers, and heteromers — using Biounit PDB definitions, making it applicable across a wide range of protein modeling tasks.
A key design goal is community standardization. The library ships with pre-computed datasets based on annual PDB release snapshots, using preprocessing parameters aligned with widely adopted community conventions. These ready-to-use benchmarking datasets lower the barrier to entry for groups who need a well-validated starting point without running the full pipeline themselves.
ProteinFlow is implemented as a Python library and can be installed via conda, pip, or Docker. The preprocessing pipeline proceeds through five stages: data acquisition from PDB or SAbDab, quality filtering based on configurable thresholds (resolution, sequence length, missing residues, experimental method), sequence-identity clustering using MMseqs2, structural feature extraction, and stratified dataset splitting. Key configurable parameters include resolution_thr for crystallographic quality, min_seq_id and max_seq_id for clustering bounds, missing_ends_thr for terminal residue tolerance, and pdb_snapshot for snapshot-specific reproducibility.
Processed datasets are stored as pickled Python dictionaries. Each entry contains backbone atom coordinates (crd_bb), sidechain coordinates (crd_sc), boolean residue masks (msk), amino acid sequences (seq), and CDR region annotations (cdr) for antibody datasets. This compact, standardized output format is designed for direct consumption by downstream model training code. The library does not itself implement a neural network model; rather, it functions as data infrastructure enabling fair and reproducible comparisons across deep learning architectures.
ProteinFlow is primarily aimed at computational biology researchers developing or benchmarking protein deep learning models. Groups working on protein sequence design can use it to generate structurally-conditioned training sets. Structure prediction researchers can use its clustering and splitting tools to ensure their train/test partitions are non-redundant. Antibody engineering teams benefit from integrated SAbDab support and CDR annotations, which are essential for training models that operate on hypervariable loop regions. The pre-computed benchmark datasets are particularly useful for groups seeking to validate new methods against community-standard baselines without incurring the computational cost of running the full pipeline from scratch.
ProteinFlow fills a practical gap in the protein deep learning ecosystem by providing a shared, reproducible data preprocessing standard. Inconsistent preprocessing is a recognized source of incomparability between published models, and a community-adopted library with standardized output formats and benchmark splits directly reduces this problem. The library is maintained by Adaptyv Biosystems and is distributed under a permissive open-source license, lowering barriers to adoption in both academic and industrial settings. Its principal limitation is scope: ProteinFlow is a data preparation tool, not a model, and it does not perform any structural prediction or design itself. Its usefulness is therefore contingent on the quality and coverage of the underlying PDB and SAbDab databases, and it inherits any biases or gaps present in those sources.
Kozlova, E., Valentin, A., Khadhraoui, A., & Nakhaee-Zadeh Gutierrez, D. (2023). ProteinFlow: a Python Library to Pre-Process Protein Structure Data for Deep Learning Applications. bioRxiv, 2023.09.25.559346.
DOI: 10.1101/2023.09.25.559346