HELM-BERT

Peptide language model trained on HELM notation, a DeBERTa encoder for property prediction on macrocyclic and non-canonical medium-sized peptides.

Released: December 2025

Parameters: 54.8 Million

HELM-BERT is the first peptide language model built on HELM (Hierarchical Editing Language for Macromolecules) notation, developed by Seungeon Lee, Takuto Koyama, Itsuki Maeda, Shigeyuki Matsumoto, and Yasushi Okuno at Kyoto University and released as a preprint in December 2025. It targets the "medium-sized" molecular regime — peptides larger than typical small molecules but smaller than proteins — where existing representations break down.

The core problem is one of representation. Atom-level SMILES strings produce very long token sequences for peptides and obscure cyclic topology, while amino-acid-level sequence representations cannot encode the non-canonical residues, backbone modifications, and macrocyclization that define modern therapeutic peptides. HELM notation resolves this tension by describing both monomer composition and connectivity in a single hierarchical syntax, making it a natural substrate for a language model that must reason about chemically diverse, often cyclic, peptides.

By pretraining a DeBERTa-based encoder on HELM strings, HELM-BERT learns transferable representations that improve property prediction for exactly the peptide classes — macrocyclic and non-canonical — that have been hardest to model with prior cheminformatics tools.

Key Features

HELM-native modeling: The first language model to operate directly on HELM notation, capturing monomer identity and inter-monomer connectivity for both linear and cyclic peptides in one representation.
DeBERTa backbone: Uses disentangled attention (separating content and position terms), an Enhanced Mask Decoder, and an n-gram induced encoding (nGiE) convolutional layer to capture hierarchical dependencies within HELM sequences.
Non-canonical coverage: Pretraining spans chemically diverse peptides including macrocyclic and modified residues, addressing a gap left by amino-acid-level models.
Self-supervised pretraining: Trained with masked language modeling using span masking and a Warmup-Stable-Decay learning-rate schedule.
Distributed weights: Code is released on GitHub (clinfo/HELM-BERT) under an MIT license, with pretrained checkpoints available for downstream fine-tuning.

Technical Details

HELM-BERT is a compact encoder with roughly 54.8M parameters: 6 transformer layers, 768 hidden dimensions, 12 attention heads, a 78-token HELM vocabulary, and a maximum sequence length of 512. The model was pretrained via masked language modeling (span masking, p=0.15) on a curated corpus of 39,079 chemically diverse peptides drawn from ChEMBL, CREMP, CycPeptMPDB, and Propedia, covering both linear and cyclic structures. On downstream benchmarks, HELM-BERT reported strong results for cyclic peptide membrane permeability regression on CycPeptMPDB (Pearson correlation ~0.82 on a random split) and peptide-protein interaction classification on Propedia v2 and ChEMBL (ROC-AUC ~0.97 and ~0.99 respectively), outperforming SMILES-based language-model baselines on the same tasks. The authors apply evidential deep learning for uncertainty quantification on the fine-tuned predictors.

Applications

HELM-BERT is aimed at therapeutic peptide discovery, where macrocyclic and non-canonical peptides are increasingly important drug modalities. Medicinal chemists and computational scientists can fine-tune it to predict properties that gate peptide developability — most directly cyclic peptide membrane permeability and peptide-protein binding — without committing to atom-level SMILES pipelines that scale poorly for these molecules. Because it consumes HELM directly, it integrates naturally with peptide design platforms that already use HELM as their interchange format.

Impact

HELM-BERT demonstrates that HELM notation is a viable and advantageous foundation for peptide language modeling, establishing a representation choice distinct from the SMILES- and sequence-based conventions that dominate small-molecule and protein models. By outperforming SMILES-based baselines on permeability and interaction tasks while remaining a small, MIT-licensed, openly distributed model, it offers an accessible starting point for groups working on macrocyclic and non-canonical peptide therapeutics. As a December 2025 preprint, its broader adoption and independent benchmarking remain to be established, and reported metrics should be read as the authors' own evaluation.

Citation

HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction

Preprint

Lee, S., et al. (2025) HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction. arXiv.org.

DOI: 10.48550/arXiv.2512.23175

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References0

GitHub

Stars14

Forks1

Open Issues0

Contributors1

Last Push9d ago

LanguagePython

LicenseMIT

HuggingFace

Downloads592

Likes1

Last Modified9d ago

Pipelinefill-mask

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

80Open

Usability — can I run it?99

Reproducibility — can I retrain it?62

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

HELM-native modeling: The first language model to operate directly on HELM notation, capturing monomer identity and inter-monomer connectivity for both linear and cyclic peptides in one representation.

DeBERTa backbone: Uses disentangled attention (separating content and position terms), an Enhanced Mask Decoder, and an n-gram induced encoding (nGiE) convolutional layer to capture hierarchical dependencies within HELM sequences.

Non-canonical coverage: Pretraining spans chemically diverse peptides including macrocyclic and modified residues, addressing a gap left by amino-acid-level models.

Self-supervised pretraining: Trained with masked language modeling using span masking and a Warmup-Stable-Decay learning-rate schedule.

Distributed weights: Code is released on GitHub (clinfo/HELM-BERT) under an MIT license, with pretrained checkpoints available for downstream fine-tuning.

Technical Details

Applications

Impact

HELM-BERT

Key Features

Technical Details

Applications

Impact

Citation

HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

HELM-BERT

Key Features

Technical Details

Applications

Impact

Citation

HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

HELM-BERT

#Key Features

#Technical Details

#Applications

#Impact

Citation

HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

HELM-BERT

#Key Features

#Technical Details

#Applications

#Impact

Citation

HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact