bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

ProteinFlow

Adaptyvbio

Open-source Python library for preprocessing protein structure data from PDB and SAbDab into curated datasets for deep learning, with built-in filtering and clustering.

Released: 2023

Overview

ProteinFlow is an open-source Python library developed by Adaptyv Biosystems that standardizes the preprocessing of protein structure data for deep learning research. Released in September 2023, it addresses a persistent bottleneck in the field: the absence of a consistent, reproducible pipeline for transforming raw PDB and SAbDab entries into well-curated, feature-rich datasets. Without such tooling, researchers routinely build bespoke preprocessing scripts that differ in filtering logic, clustering strategy, and output format — making results difficult to compare and experiments hard to reproduce.

The library covers the full preprocessing workflow from data acquisition to dataset splitting. It downloads protein structures from the Protein Data Bank (PDB) or the Structural Antibody Database (SAbDab), applies configurable quality filters, clusters entries by sequence identity to prevent data leakage, extracts structural features, and partitions data into training, validation, and test sets. ProteinFlow supports all levels of protein organization — single chains, homomers, and heteromers — using Biounit PDB definitions, making it applicable across a wide range of protein modeling tasks.

A key design goal is community standardization. The library ships with pre-computed datasets based on annual PDB release snapshots, using preprocessing parameters aligned with widely adopted community conventions. These ready-to-use benchmarking datasets lower the barrier to entry for groups who need a well-validated starting point without running the full pipeline themselves.

Key Features

  • Multi-chain and multi-level support: Processes single-chain and multi-chain structures using Biounit PDB definitions, capturing all levels of protein organization from primary sequence through quaternary assembly.
  • Flexible featurization: Computes backbone atom coordinates (N, C, CA, O), sidechain coordinates, residue masks, secondary structure annotations, and torsion angles, with user-configurable feature combinations.
  • Sequence-identity clustering: Groups proteins by sequence similarity and uses these clusters to partition data into train/validation/test splits, minimizing leakage between sets.
  • SAbDab and CDR annotation support: Handles antibody-specific structures from SAbDab with automatic annotation of Complementarity-Determining Regions (CDRs) for antibody engineering applications.
  • Pre-computed benchmark datasets: Provides curated, ready-to-use datasets derived from annual PDB snapshots with community-standard preprocessing, enabling reproducible benchmarking across research groups.
  • Framework-compatible data loaders: Includes conversion utilities and loader integrations compatible with standard deep learning frameworks, enabling direct use in training pipelines.

Technical Details

ProteinFlow is implemented as a Python library and can be installed via conda, pip, or Docker. The preprocessing pipeline proceeds through five stages: data acquisition from PDB or SAbDab, quality filtering based on configurable thresholds (resolution, sequence length, missing residues, experimental method), sequence-identity clustering using MMseqs2, structural feature extraction, and stratified dataset splitting. Key configurable parameters include resolution_thr for crystallographic quality, min_seq_id and max_seq_id for clustering bounds, missing_ends_thr for terminal residue tolerance, and pdb_snapshot for snapshot-specific reproducibility.

Processed datasets are stored as pickled Python dictionaries. Each entry contains backbone atom coordinates (crd_bb), sidechain coordinates (crd_sc), boolean residue masks (msk), amino acid sequences (seq), and CDR region annotations (cdr) for antibody datasets. This compact, standardized output format is designed for direct consumption by downstream model training code. The library does not itself implement a neural network model; rather, it functions as data infrastructure enabling fair and reproducible comparisons across deep learning architectures.

Applications

ProteinFlow is primarily aimed at computational biology researchers developing or benchmarking protein deep learning models. Groups working on protein sequence design can use it to generate structurally-conditioned training sets. Structure prediction researchers can use its clustering and splitting tools to ensure their train/test partitions are non-redundant. Antibody engineering teams benefit from integrated SAbDab support and CDR annotations, which are essential for training models that operate on hypervariable loop regions. The pre-computed benchmark datasets are particularly useful for groups seeking to validate new methods against community-standard baselines without incurring the computational cost of running the full pipeline from scratch.

Impact

ProteinFlow fills a practical gap in the protein deep learning ecosystem by providing a shared, reproducible data preprocessing standard. Inconsistent preprocessing is a recognized source of incomparability between published models, and a community-adopted library with standardized output formats and benchmark splits directly reduces this problem. The library is maintained by Adaptyv Biosystems and is distributed under a permissive open-source license, lowering barriers to adoption in both academic and industrial settings. Its principal limitation is scope: ProteinFlow is a data preparation tool, not a model, and it does not perform any structural prediction or design itself. Its usefulness is therefore contingent on the quality and coverage of the underlying PDB and SAbDab databases, and it inherits any biases or gaps present in those sources.

Citation

ProteinFlow: a Python Library to Pre-Process Protein Structure Data for Deep Learning Applications

Preprint

Kozlova, E., Valentin, A., Khadhraoui, A., & Nakhaee-Zadeh Gutierrez, D. (2023). ProteinFlow: a Python Library to Pre-Process Protein Structure Data for Deep Learning Applications. bioRxiv, 2023.09.25.559346.

DOI: 10.1101/2023.09.25.559346

Metrics

GitHub

Stars277
Forks17
Open Issues2
Contributors5
Last Push2y ago
LanguagePython
LicenseBSD-3-Clause

Citations

Total Citations7
Influential0
References69

Tags

preprocessing

Resources

GitHub RepositoryResearch Paper