University of Texas at Austin / Novo Nordisk
SMILES-based chemical language models pretrained on 100M+ molecules to natively represent therapeutic peptide chemistry, including non-canonical residues.
Therapeutic peptides sit in an awkward middle ground for computational modeling: they offer the binding specificity of proteins alongside the chemical diversity of small molecules, but neither standard protein language models nor small-molecule chemical models handle them well. Protein models are typically restricted to the 20 canonical amino acids and cannot represent the cyclizations, conjugations, and non-canonical residues that define modern peptide drugs, while atom-level chemical models struggle with the length of polymer-like peptide sequences. PeptideCLM-2 was built to close this gap by learning directly from the SMILES representation of complete molecules, treating peptides as chemistry rather than as a fixed amino-acid alphabet.
Developed by researchers at the University of Texas at Austin (Integrative Biology) and Novo Nordisk's Molecular AI group, and released as a 2026 bioRxiv preprint, PeptideCLM-2 is a suite of BERT-style transformer encoders pretrained on more than 100 million molecules. It is the successor to the original PeptideCLM, expanding substantially on that work in data scale, architectural depth, and benchmarking. Where the first model demonstrated feasibility for cyclic peptides, version 2 adds systematic parameter-scaling studies and evaluation across diverse biological phenotypes.
PeptideCLM-2 uses BERT-style transformer encoders with rotary positional embeddings (RoPE), SwiGLU activations, and pre-layer normalization, trained with a 25% span-masking rate. The three model scales span 32M parameters (6 layers, 384 hidden, 6 heads), 114M (12 layers, 768 hidden, 12 heads), and 337M (24 layers, 1024 hidden, 16 heads). Pretraining draws on a composite corpus of roughly 108M drug-like molecules from PubChem, 9.6M peptide sequences from ESMAtlas, and ~50K lipids from LIPID MAPS, balanced per epoch to avoid swamping peptides with small molecules. On downstream tasks the 337M model improves over baselines for membrane permeability (AUROC 0.830 vs 0.781), fibrillation propensity (AUROC 0.823 vs 0.579), blood stability (MCC 0.609 vs 0.537), cell penetration, tumor homing, and antimicrobial activity. A notable scaling finding: at small scale, explicit physicochemical supervision (MTR) substantially outperforms MLM (R² ≈ 0.38 vs 0.13), but at 337M parameters pure MLM converges with MTR (both R² ≈ 0.58), nearly doubling molecular-fingerprint baselines.
PeptideCLM-2 is intended for therapeutic peptide engineering, where teams need property predictions for molecules that fall outside the reach of conventional protein models. Researchers can fine-tune the encoders to predict developability-relevant properties such as membrane diffusion, cell penetration, tumor homing, half-life, antimicrobial activity, and aggregation or fibrillation propensity, or use the learned embeddings as features for downstream screening and lead optimization. Because the models ingest SMILES directly, they accommodate cyclic and chemically modified peptides common in pharmaceutical pipelines.
PeptideCLM-2 addresses a genuine blind spot between protein and small-molecule machine learning, providing an open, reproducibly benchmarked foundation for a therapeutic class of growing commercial importance. Its systematic comparison of self-supervised versus descriptor-supervised pretraining across scales offers practical guidance on when each strategy is worthwhile, a result relevant beyond peptides. As a preprint with released weights, code, and data, its long-term influence will depend on community adoption and independent validation; the model and dataset cards are currently concise, with the GitHub repository and preprint serving as the primary documentation.