The first peptide language model trained on HELM notation, a DeBERTa encoder for medium-sized and non-canonical peptide property prediction.
HELM-BERT is the first peptide language model built on HELM (Hierarchical Editing Language for Macromolecules) notation, developed by Seungeon Lee, Takuto Koyama, Itsuki Maeda, Shigeyuki Matsumoto, and Yasushi Okuno at Kyoto University and released as a preprint in December 2025. It targets the "medium-sized" molecular regime — peptides larger than typical small molecules but smaller than proteins — where existing representations break down.
The core problem is one of representation. Atom-level SMILES strings produce very long token sequences for peptides and obscure cyclic topology, while amino-acid-level sequence representations cannot encode the non-canonical residues, backbone modifications, and macrocyclization that define modern therapeutic peptides. HELM notation resolves this tension by describing both monomer composition and connectivity in a single hierarchical syntax, making it a natural substrate for a language model that must reason about chemically diverse, often cyclic, peptides.
By pretraining a DeBERTa-based encoder on HELM strings, HELM-BERT learns transferable representations that improve property prediction for exactly the peptide classes — macrocyclic and non-canonical — that have been hardest to model with prior cheminformatics tools.
HELM-BERT is a compact encoder with roughly 54.8M parameters: 6 transformer layers, 768 hidden dimensions, 12 attention heads, a 78-token HELM vocabulary, and a maximum sequence length of 512. The model was pretrained via masked language modeling (span masking, p=0.15) on a curated corpus of 39,079 chemically diverse peptides drawn from ChEMBL, CREMP, CycPeptMPDB, and Propedia, covering both linear and cyclic structures. On downstream benchmarks, HELM-BERT reported strong results for cyclic peptide membrane permeability regression on CycPeptMPDB (Pearson correlation ~0.82 on a random split) and peptide-protein interaction classification on Propedia v2 and ChEMBL (ROC-AUC ~0.97 and ~0.99 respectively), outperforming SMILES-based language-model baselines on the same tasks. The authors apply evidential deep learning for uncertainty quantification on the fine-tuned predictors.
HELM-BERT is aimed at therapeutic peptide discovery, where macrocyclic and non-canonical peptides are increasingly important drug modalities. Medicinal chemists and computational scientists can fine-tune it to predict properties that gate peptide developability — most directly cyclic peptide membrane permeability and peptide-protein binding — without committing to atom-level SMILES pipelines that scale poorly for these molecules. Because it consumes HELM directly, it integrates naturally with peptide design platforms that already use HELM as their interchange format.
HELM-BERT demonstrates that HELM notation is a viable and advantageous foundation for peptide language modeling, establishing a representation choice distinct from the SMILES- and sequence-based conventions that dominate small-molecule and protein models. By outperforming SMILES-based baselines on permeability and interaction tasks while remaining a small, MIT-licensed, openly distributed model, it offers an accessible starting point for groups working on macrocyclic and non-canonical peptide therapeutics. As a December 2025 preprint, its broader adoption and independent benchmarking remain to be established, and reported metrics should be read as the authors' own evaluation.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data