bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
ProteinSmall molecule

PepForge

Technical University of Berlin

A hierarchical three-stage cascade that generates chemically modified and macrocyclic peptides in HELM notation, supporting de novo design and constrained infilling.

Released: June 2026

PepForge is a generative deep learning platform for designing chemically modified and macrocyclic peptides, developed by Qingxin Wang and Roderich D. Süssmuth at the Technical University of Berlin and posted to bioRxiv in June 2026. Unlike protein language models that operate on the 20 canonical amino acids, PepForge generates molecules in HELM (Hierarchical Editing Language for Macromolecules) notation, the industry-standard representation that captures non-canonical monomers, side-chain modifications, branch points, and the non-linear connections that define cyclic and stapled peptides.

The central problem PepForge addresses is that therapeutic peptide chemistry is fundamentally graph-structured: macrocycles, disulfide bridges, and chemical staples cannot be expressed as a simple linear sequence, which limits sequence-only generators. PepForge tackles this by decomposing HELM molecule generation into three sequential sub-problems—structural layout, monomer content, and inter-residue connections—each handled by a specialized model in a cascade. This hierarchical factorization lets the system build chemically valid, topologically complex peptides while keeping each stage tractable.

PepForge is distinct from sequence-centric peptide models such as PeptideCLM-2 and from mass-spec-oriented sequence-to-sequence tools. It is a generative design engine purpose-built for the modified-peptide chemical space, with seven pretrained checkpoints released publicly so users can sample new molecules without retraining.

#Key Features

  • Three-stage hierarchical cascade: A Layout model generates the block-level template, a Content model fills in the monomer sequence, and a Connection model predicts inter-residue bonds, decomposing complex HELM generation into manageable steps.
  • HELM-native macrocycle support: The Connection stage performs binary edge classification on molecular graphs with R-group constraint enforcement, enabling valid cyclic, branched, and chemically modified topologies that linear generators cannot represent.
  • Flexible generation modes: Supports unconditional de novo generation, masked infilling of partial designs, and multi-level constrained generation where users fix the layout, specific monomers, or connections—all from the same frozen checkpoints.
  • Dual content models: Ships both an autoregressive GPT-style content model and a BERT-style masked model, giving users a choice between sequential sampling and bidirectional infilling.
  • Optional activity prediction: An accompanying antimicrobial-peptide ensemble and ADMET predictors (hemolysis, toxicity, half-life) can score generated candidates within the same workflow.

#Technical Details

PepForge was trained on a deduplicated corpus of 383,817 unique HELM peptides (InChIKey-deduplicated) drawn from PubChem, CycPeptMPDB, ChEMBL, DBAASP, UniProt, and MacrocycleDB, covering 425 monomer types and split roughly 307k/38k/38k for train/validation/test. The Layout stage is a compact GPT (d=64, 1 layer, test perplexity 2.24); the Content stage offers a GPT (d=768, 12 layers, PPL 6.61) and a BERT (d=768, 12 layers, PPL 9.15); the Connection stage is a graph attention network (GAT, d=768, 6 layers) reaching 0.971 edge-existence F1 and 0.912 macro-F1 on bond-type classification. A separate antimicrobial-peptide ensemble (LSTM and GCN members over SMILES and HELM inputs) supports activity scoring. The full checkpoint set is roughly 5.5 GB. Code is MIT-licensed; weights and datasets are released under CC-BY-4.0.

#Applications

PepForge targets medicinal chemists and peptide-therapeutics researchers who need to explore the modified-peptide design space beyond natural amino acids. It can propose novel macrocyclic or stapled scaffolds de novo, complete partial designs through masked infilling, and enforce hard constraints—fixed warhead monomers, required ring connections, or backbone layouts—during generation. Coupled with its antimicrobial-activity and ADMET predictors and an external-predictor plugin system, it supports a generate-then-score loop for antibacterial peptide discovery and other therapeutic campaigns, accessible via scripts or an included FastAPI/React web interface.

#Impact

PepForge contributes a structured, graph-aware approach to a domain that has been difficult for sequence-only generative models: chemically modified and macrocyclic peptides that dominate modern peptide drug discovery. By releasing the HELM corpus, monomer libraries, seven pretrained checkpoints, and a constraint-driven generation interface, it lowers the barrier to designing non-canonical peptides without bespoke training infrastructure. As a recent preprint its real-world adoption and experimentally validated hit rates remain to be established, and reported metrics are intrinsic generation-quality benchmarks rather than wet-lab outcomes, but the explicit modeling of layout, content, and connection offers a reusable template for chemically aware biomolecular generation.

Citation

PepForge: Hierarchical HELM-Based Peptide Generation

Wang, Q. & Süssmuth, R. (2026) PepForge: Hierarchical HELM-Based Peptide Generation. bioRxiv.

DOI: 10.64898/2026.05.29.728379

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References39

GitHub

Stars4
Forks0
Open Issues0
Contributors1
Last Push7d ago
LanguageJupyter Notebook
LicenseMIT

HuggingFace

Downloads0
Likes0
Last Modified7d ago
Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
94Open
Usability — can I run it?100
Reproducibility — can I retrain it?92
Model Openness Framework
Class II
Open Tooling

Tags

antimicrobial_peptidesbertde_novo_designdrug_discoverygenerativegraph_neural_networkmacrocyclic_peptidespeptide_designself_supervisedtransformer

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset