Technical University of Berlin
A hierarchical three-stage cascade that generates chemically modified and macrocyclic peptides in HELM notation, supporting de novo design and constrained infilling.
PepForge is a generative deep learning platform for designing chemically modified and macrocyclic peptides, developed by Qingxin Wang and Roderich D. Süssmuth at the Technical University of Berlin and posted to bioRxiv in June 2026. Unlike protein language models that operate on the 20 canonical amino acids, PepForge generates molecules in HELM (Hierarchical Editing Language for Macromolecules) notation, the industry-standard representation that captures non-canonical monomers, side-chain modifications, branch points, and the non-linear connections that define cyclic and stapled peptides.
The central problem PepForge addresses is that therapeutic peptide chemistry is fundamentally graph-structured: macrocycles, disulfide bridges, and chemical staples cannot be expressed as a simple linear sequence, which limits sequence-only generators. PepForge tackles this by decomposing HELM molecule generation into three sequential sub-problems—structural layout, monomer content, and inter-residue connections—each handled by a specialized model in a cascade. This hierarchical factorization lets the system build chemically valid, topologically complex peptides while keeping each stage tractable.
PepForge is distinct from sequence-centric peptide models such as PeptideCLM-2 and from mass-spec-oriented sequence-to-sequence tools. It is a generative design engine purpose-built for the modified-peptide chemical space, with seven pretrained checkpoints released publicly so users can sample new molecules without retraining.
PepForge was trained on a deduplicated corpus of 383,817 unique HELM peptides (InChIKey-deduplicated) drawn from PubChem, CycPeptMPDB, ChEMBL, DBAASP, UniProt, and MacrocycleDB, covering 425 monomer types and split roughly 307k/38k/38k for train/validation/test. The Layout stage is a compact GPT (d=64, 1 layer, test perplexity 2.24); the Content stage offers a GPT (d=768, 12 layers, PPL 6.61) and a BERT (d=768, 12 layers, PPL 9.15); the Connection stage is a graph attention network (GAT, d=768, 6 layers) reaching 0.971 edge-existence F1 and 0.912 macro-F1 on bond-type classification. A separate antimicrobial-peptide ensemble (LSTM and GCN members over SMILES and HELM inputs) supports activity scoring. The full checkpoint set is roughly 5.5 GB. Code is MIT-licensed; weights and datasets are released under CC-BY-4.0.
PepForge targets medicinal chemists and peptide-therapeutics researchers who need to explore the modified-peptide design space beyond natural amino acids. It can propose novel macrocyclic or stapled scaffolds de novo, complete partial designs through masked infilling, and enforce hard constraints—fixed warhead monomers, required ring connections, or backbone layouts—during generation. Coupled with its antimicrobial-activity and ADMET predictors and an external-predictor plugin system, it supports a generate-then-score loop for antibacterial peptide discovery and other therapeutic campaigns, accessible via scripts or an included FastAPI/React web interface.
PepForge contributes a structured, graph-aware approach to a domain that has been difficult for sequence-only generative models: chemically modified and macrocyclic peptides that dominate modern peptide drug discovery. By releasing the HELM corpus, monomer libraries, seven pretrained checkpoints, and a constraint-driven generation interface, it lowers the barrier to designing non-canonical peptides without bespoke training infrastructure. As a recent preprint its real-world adoption and experimentally validated hit rates remain to be established, and reported metrics are intrinsic generation-quality benchmarks rather than wet-lab outcomes, but the explicit modeling of layout, content, and connection offers a reusable template for chemically aware biomolecular generation.
Wang, Q. & Süssmuth, R. (2026) PepForge: Hierarchical HELM-Based Peptide Generation. bioRxiv.
DOI: 10.64898/2026.05.29.728379Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data