PepForge

Generative model for chemically modified and macrocyclic peptides that builds molecules in HELM notation, supporting de novo design and infilling.

Released: June 2026

PepForge is a generative deep learning platform for designing chemically modified and macrocyclic peptides, developed by Qingxin Wang and Roderich D. Süssmuth at the Technical University of Berlin and posted to bioRxiv in June 2026. Unlike protein language models that operate on the 20 canonical amino acids, PepForge generates molecules in HELM (Hierarchical Editing Language for Macromolecules) notation, the industry-standard representation that captures non-canonical monomers, side-chain modifications, branch points, and the non-linear connections that define cyclic and stapled peptides.

The central problem PepForge addresses is that therapeutic peptide chemistry is fundamentally graph-structured: macrocycles, disulfide bridges, and chemical staples cannot be expressed as a simple linear sequence, which limits sequence-only generators. PepForge tackles this by decomposing HELM molecule generation into three sequential sub-problems—structural layout, monomer content, and inter-residue connections—each handled by a specialized model in a cascade. This hierarchical factorization lets the system build chemically valid, topologically complex peptides while keeping each stage tractable.

PepForge is distinct from sequence-centric peptide models such as PeptideCLM-2 and from mass-spec-oriented sequence-to-sequence tools. It is a generative design engine purpose-built for the modified-peptide chemical space, with seven pretrained checkpoints released publicly so users can sample new molecules without retraining.

Key Features

Three-stage hierarchical cascade: A Layout model generates the block-level template, a Content model fills in the monomer sequence, and a Connection model predicts inter-residue bonds, decomposing complex HELM generation into manageable steps.
HELM-native macrocycle support: The Connection stage performs binary edge classification on molecular graphs with R-group constraint enforcement, enabling valid cyclic, branched, and chemically modified topologies that linear generators cannot represent.
Flexible generation modes: Supports unconditional de novo generation, masked infilling of partial designs, and multi-level constrained generation where users fix the layout, specific monomers, or connections—all from the same frozen checkpoints.
Dual content models: Ships both an autoregressive GPT-style content model and a BERT-style masked model, giving users a choice between sequential sampling and bidirectional infilling.
Optional activity prediction: An accompanying antimicrobial-peptide ensemble and ADMET predictors (hemolysis, toxicity, half-life) can score generated candidates within the same workflow.

Technical Details

PepForge was trained on a deduplicated corpus of 383,817 unique HELM peptides (InChIKey-deduplicated) drawn from PubChem, CycPeptMPDB, ChEMBL, DBAASP, UniProt, and MacrocycleDB, covering 425 monomer types and split roughly 307k/38k/38k for train/validation/test. The Layout stage is a compact GPT (d=64, 1 layer, test perplexity 2.24); the Content stage offers a GPT (d=768, 12 layers, PPL 6.61) and a BERT (d=768, 12 layers, PPL 9.15); the Connection stage is a graph attention network (GAT, d=768, 6 layers) reaching 0.971 edge-existence F1 and 0.912 macro-F1 on bond-type classification. A separate antimicrobial-peptide ensemble (LSTM and GCN members over SMILES and HELM inputs) supports activity scoring. The full checkpoint set is roughly 5.5 GB. Code is MIT-licensed; weights and datasets are released under CC-BY-4.0.

Applications

PepForge targets medicinal chemists and peptide-therapeutics researchers who need to explore the modified-peptide design space beyond natural amino acids. It can propose novel macrocyclic or stapled scaffolds de novo, complete partial designs through masked infilling, and enforce hard constraints—fixed warhead monomers, required ring connections, or backbone layouts—during generation. Coupled with its antimicrobial-activity and ADMET predictors and an external-predictor plugin system, it supports a generate-then-score loop for antibacterial peptide discovery and other therapeutic campaigns, accessible via scripts or an included FastAPI/React web interface.

Impact

PepForge contributes a structured, graph-aware approach to a domain that has been difficult for sequence-only generative models: chemically modified and macrocyclic peptides that dominate modern peptide drug discovery. By releasing the HELM corpus, monomer libraries, seven pretrained checkpoints, and a constraint-driven generation interface, it lowers the barrier to designing non-canonical peptides without bespoke training infrastructure. As a recent preprint its real-world adoption and experimentally validated hit rates remain to be established, and reported metrics are intrinsic generation-quality benchmarks rather than wet-lab outcomes, but the explicit modeling of layout, content, and connection offers a reusable template for chemically aware biomolecular generation.

Citation

PepForge: Hierarchical HELM-Based Peptide Generation

Wang, Q. & Süssmuth, R. (2026) PepForge: Hierarchical HELM-Based Peptide Generation. bioRxiv.

DOI: 10.64898/2026.05.29.728379

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References39

GitHub

Stars4

Forks0

Open Issues0

Contributors1

Last Push1mo ago

LanguageJupyter Notebook

LicenseMIT

HuggingFace

Downloads0

Likes0

Last Modified1mo ago

Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

94Open

Usability — can I run it?100

Reproducibility — can I retrain it?92

Model Openness Framework

Class II

Open Tooling

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Three-stage hierarchical cascade: A Layout model generates the block-level template, a Content model fills in the monomer sequence, and a Connection model predicts inter-residue bonds, decomposing complex HELM generation into manageable steps.

HELM-native macrocycle support: The Connection stage performs binary edge classification on molecular graphs with R-group constraint enforcement, enabling valid cyclic, branched, and chemically modified topologies that linear generators cannot represent.

Flexible generation modes: Supports unconditional de novo generation, masked infilling of partial designs, and multi-level constrained generation where users fix the layout, specific monomers, or connections—all from the same frozen checkpoints.

Dual content models: Ships both an autoregressive GPT-style content model and a BERT-style masked model, giving users a choice between sequential sampling and bidirectional infilling.

Optional activity prediction: An accompanying antimicrobial-peptide ensemble and ADMET predictors (hemolysis, toxicity, half-life) can score generated candidates within the same workflow.

Technical Details

Applications

Impact

PepForge

Key Features

Technical Details

Applications

Impact

Citation

PepForge: Hierarchical HELM-Based Peptide Generation

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

PepForge

Key Features

Technical Details

Applications

Impact

Citation

PepForge: Hierarchical HELM-Based Peptide Generation

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

PepForge

#Key Features

#Technical Details

#Applications

#Impact

Citation

PepForge: Hierarchical HELM-Based Peptide Generation

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

PepForge

#Key Features

#Technical Details

#Applications

#Impact

Citation

PepForge: Hierarchical HELM-Based Peptide Generation

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact