bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule foundation models
Small molecule

HELM-BERT

Kyoto University

The first peptide language model trained on HELM notation, a DeBERTa encoder for medium-sized and non-canonical peptide property prediction.

Released: December 2025
Parameters: 54.8 Million

HELM-BERT is the first peptide language model built on HELM (Hierarchical Editing Language for Macromolecules) notation, developed by Seungeon Lee, Takuto Koyama, Itsuki Maeda, Shigeyuki Matsumoto, and Yasushi Okuno at Kyoto University and released as a preprint in December 2025. It targets the "medium-sized" molecular regime — peptides larger than typical small molecules but smaller than proteins — where existing representations break down.

The core problem is one of representation. Atom-level SMILES strings produce very long token sequences for peptides and obscure cyclic topology, while amino-acid-level sequence representations cannot encode the non-canonical residues, backbone modifications, and macrocyclization that define modern therapeutic peptides. HELM notation resolves this tension by describing both monomer composition and connectivity in a single hierarchical syntax, making it a natural substrate for a language model that must reason about chemically diverse, often cyclic, peptides.

By pretraining a DeBERTa-based encoder on HELM strings, HELM-BERT learns transferable representations that improve property prediction for exactly the peptide classes — macrocyclic and non-canonical — that have been hardest to model with prior cheminformatics tools.

#Key Features

  • HELM-native modeling: The first language model to operate directly on HELM notation, capturing monomer identity and inter-monomer connectivity for both linear and cyclic peptides in one representation.
  • DeBERTa backbone: Uses disentangled attention (separating content and position terms), an Enhanced Mask Decoder, and an n-gram induced encoding (nGiE) convolutional layer to capture hierarchical dependencies within HELM sequences.
  • Non-canonical coverage: Pretraining spans chemically diverse peptides including macrocyclic and modified residues, addressing a gap left by amino-acid-level models.
  • Self-supervised pretraining: Trained with masked language modeling using span masking and a Warmup-Stable-Decay learning-rate schedule.
  • Distributed weights: Code is released on GitHub (clinfo/HELM-BERT) under an MIT license, with pretrained checkpoints available for downstream fine-tuning.

#Technical Details

HELM-BERT is a compact encoder with roughly 54.8M parameters: 6 transformer layers, 768 hidden dimensions, 12 attention heads, a 78-token HELM vocabulary, and a maximum sequence length of 512. The model was pretrained via masked language modeling (span masking, p=0.15) on a curated corpus of 39,079 chemically diverse peptides drawn from ChEMBL, CREMP, CycPeptMPDB, and Propedia, covering both linear and cyclic structures. On downstream benchmarks, HELM-BERT reported strong results for cyclic peptide membrane permeability regression on CycPeptMPDB (Pearson correlation ~0.82 on a random split) and peptide-protein interaction classification on Propedia v2 and ChEMBL (ROC-AUC ~0.97 and ~0.99 respectively), outperforming SMILES-based language-model baselines on the same tasks. The authors apply evidential deep learning for uncertainty quantification on the fine-tuned predictors.

#Applications

HELM-BERT is aimed at therapeutic peptide discovery, where macrocyclic and non-canonical peptides are increasingly important drug modalities. Medicinal chemists and computational scientists can fine-tune it to predict properties that gate peptide developability — most directly cyclic peptide membrane permeability and peptide-protein binding — without committing to atom-level SMILES pipelines that scale poorly for these molecules. Because it consumes HELM directly, it integrates naturally with peptide design platforms that already use HELM as their interchange format.

#Impact

HELM-BERT demonstrates that HELM notation is a viable and advantageous foundation for peptide language modeling, establishing a representation choice distinct from the SMILES- and sequence-based conventions that dominate small-molecule and protein models. By outperforming SMILES-based baselines on permeability and interaction tasks while remaining a small, MIT-licensed, openly distributed model, it offers an accessible starting point for groups working on macrocyclic and non-canonical peptide therapeutics. As a December 2025 preprint, its broader adoption and independent benchmarking remain to be established, and reported metrics should be read as the authors' own evaluation.

Citation

Preprint

DOI: 10.48550/arXiv.2512.23175

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
80Open
Usability — can I run it?99
Reproducibility — can I retrain it?62
Model Openness Framework
Unclassified
Missing required components

Tags

debertalanguage_modelmacrocyclesmembrane_permeability_predictionpeptide_protein_interactionpeptidesproperty_predictionrepresentation_learningself_supervisedtransformer

Resources

GitHub RepositoryResearch PaperHuggingFace Model