PeptideCLM-2

University of Texas at Austin / Novo Nordisk

Chemical language models pretrained on SMILES for therapeutic peptides, natively representing non-canonical residues, cyclization, and conjugation.

Released: April 2026

Therapeutic peptides sit in an awkward middle ground for computational modeling: they offer the binding specificity of proteins alongside the chemical diversity of small molecules, but neither standard protein language models nor small-molecule chemical models handle them well. Protein models are typically restricted to the 20 canonical amino acids and cannot represent the cyclizations, conjugations, and non-canonical residues that define modern peptide drugs, while atom-level chemical models struggle with the length of polymer-like peptide sequences. PeptideCLM-2 was built to close this gap by learning directly from the SMILES representation of complete molecules, treating peptides as chemistry rather than as a fixed amino-acid alphabet.

Developed by researchers at the University of Texas at Austin (Integrative Biology) and Novo Nordisk's Molecular AI group, and released as a 2026 bioRxiv preprint, PeptideCLM-2 is a suite of BERT-style transformer encoders pretrained on more than 100 million molecules. It is the successor to the original PeptideCLM, expanding substantially on that work in data scale, architectural depth, and benchmarking. Where the first model demonstrated feasibility for cyclic peptides, version 2 adds systematic parameter-scaling studies and evaluation across diverse biological phenotypes.

Key Features

Native peptide chemistry: By tokenizing raw SMILES rather than residue letters, the models represent non-canonical amino acids, cyclic backbones, and conjugated payloads that canonical protein language models cannot encode.
Compressed k-mer tokenizer: A custom 405-token vocabulary (160 single-atom plus 245 k-mer tokens) maps recurring substructural motifs to single tokens, shortening sequences by 38% for small molecules and 64% for natural peptides versus atom-level encoding, without accuracy loss.
Three training objectives: The suite compares masked language modeling (MLM), multi-task regression (MTR) against 99 RDKit physicochemical descriptors, and a hybrid dual-objective loss, clarifying when explicit supervision helps.
Nine released variants: Small (32M), base (114M), and large (337M) checkpoints are released for each objective on HuggingFace, supporting reproducible benchmarking and fine-tuning.
Open and commercially usable: Weights, code, and the pretraining corpus are released under CC BY 4.0, permitting commercial use.

Technical Details

PeptideCLM-2 uses BERT-style transformer encoders with rotary positional embeddings (RoPE), SwiGLU activations, and pre-layer normalization, trained with a 25% span-masking rate. The three model scales span 32M parameters (6 layers, 384 hidden, 6 heads), 114M (12 layers, 768 hidden, 12 heads), and 337M (24 layers, 1024 hidden, 16 heads). Pretraining draws on a composite corpus of roughly 108M drug-like molecules from PubChem, 9.6M peptide sequences from ESMAtlas, and ~50K lipids from LIPID MAPS, balanced per epoch to avoid swamping peptides with small molecules. On downstream tasks the 337M model improves over baselines for membrane permeability (AUROC 0.830 vs 0.781), fibrillation propensity (AUROC 0.823 vs 0.579), blood stability (MCC 0.609 vs 0.537), cell penetration, tumor homing, and antimicrobial activity. A notable scaling finding: at small scale, explicit physicochemical supervision (MTR) substantially outperforms MLM (R² ≈ 0.38 vs 0.13), but at 337M parameters pure MLM converges with MTR (both R² ≈ 0.58), nearly doubling molecular-fingerprint baselines.

Applications

PeptideCLM-2 is intended for therapeutic peptide engineering, where teams need property predictions for molecules that fall outside the reach of conventional protein models. Researchers can fine-tune the encoders to predict developability-relevant properties such as membrane diffusion, cell penetration, tumor homing, half-life, antimicrobial activity, and aggregation or fibrillation propensity, or use the learned embeddings as features for downstream screening and lead optimization. Because the models ingest SMILES directly, they accommodate cyclic and chemically modified peptides common in pharmaceutical pipelines.

Impact

PeptideCLM-2 addresses a genuine blind spot between protein and small-molecule machine learning, providing an open, reproducibly benchmarked foundation for a therapeutic class of growing commercial importance. Its systematic comparison of self-supervised versus descriptor-supervised pretraining across scales offers practical guidance on when each strategy is worthwhile, a result relevant beyond peptides. As a preprint with released weights, code, and data, its long-term influence will depend on community adoption and independent validation; the model and dataset cards are currently concise, with the GitHub repository and preprint serving as the primary documentation.

Citation

Scaling SMILES-based chemical language models for therapeutic peptide engineering

Feller, A. L., et al. (2026) Scaling SMILES-based chemical language models for therapeutic peptide engineering. bioRxiv.

DOI: 10.64898/2026.01.06.697994

Recent citations

Papers that recently cited this model.

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling
R. Ellerbrock, Alessio Valentini, Alexander C. Paul, et al.
bioRxiv · Jun 2026
0

Top citations

The most-cited papers that cite this model.

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling
R. Ellerbrock, Alessio Valentini, Alexander C. Paul, et al.
bioRxiv · Jun 2026
0

Citations

Total Citations1

Influential0

References41

GitHub

Stars10

Forks1

Open Issues0

Contributors1

Last Push19d ago

LanguageJupyter Notebook

LicenseMIT

Fields of citing research

Biology100%
Computer Science100%
Medicine100%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

79Open

Usability — can I run it?72

Reproducibility — can I retrain it?92

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Native peptide chemistry: By tokenizing raw SMILES rather than residue letters, the models represent non-canonical amino acids, cyclic backbones, and conjugated payloads that canonical protein language models cannot encode.

Compressed k-mer tokenizer: A custom 405-token vocabulary (160 single-atom plus 245 k-mer tokens) maps recurring substructural motifs to single tokens, shortening sequences by 38% for small molecules and 64% for natural peptides versus atom-level encoding, without accuracy loss.

Three training objectives: The suite compares masked language modeling (MLM), multi-task regression (MTR) against 99 RDKit physicochemical descriptors, and a hybrid dual-objective loss, clarifying when explicit supervision helps.

Nine released variants: Small (32M), base (114M), and large (337M) checkpoints are released for each objective on HuggingFace, supporting reproducible benchmarking and fine-tuning.

Open and commercially usable: Weights, code, and the pretraining corpus are released under CC BY 4.0, permitting commercial use.

Technical Details

Applications

Impact

PeptideCLM-2

Key Features

Technical Details

Applications

Impact

Citation

Scaling SMILES-based chemical language models for therapeutic peptide engineering

Recent citations

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Top citations

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PeptideCLM-2

Key Features

Technical Details

Applications

Impact

Citation

Scaling SMILES-based chemical language models for therapeutic peptide engineering

Recent citations

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Top citations

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PeptideCLM-2

#Key Features

#Technical Details

#Applications

#Impact

Citation

Scaling SMILES-based chemical language models for therapeutic peptide engineering

Recent citations

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Top citations

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PeptideCLM-2

#Key Features

#Technical Details

#Applications

#Impact

Citation

Scaling SMILES-based chemical language models for therapeutic peptide engineering

Recent citations

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Top citations

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact