bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small moleculeProtein

PeptideCLM-2

University of Texas at Austin / Novo Nordisk

SMILES-based chemical language models pretrained on 100M+ molecules to natively represent therapeutic peptide chemistry, including non-canonical residues.

Released: April 2026

Therapeutic peptides sit in an awkward middle ground for computational modeling: they offer the binding specificity of proteins alongside the chemical diversity of small molecules, but neither standard protein language models nor small-molecule chemical models handle them well. Protein models are typically restricted to the 20 canonical amino acids and cannot represent the cyclizations, conjugations, and non-canonical residues that define modern peptide drugs, while atom-level chemical models struggle with the length of polymer-like peptide sequences. PeptideCLM-2 was built to close this gap by learning directly from the SMILES representation of complete molecules, treating peptides as chemistry rather than as a fixed amino-acid alphabet.

Developed by researchers at the University of Texas at Austin (Integrative Biology) and Novo Nordisk's Molecular AI group, and released as a 2026 bioRxiv preprint, PeptideCLM-2 is a suite of BERT-style transformer encoders pretrained on more than 100 million molecules. It is the successor to the original PeptideCLM, expanding substantially on that work in data scale, architectural depth, and benchmarking. Where the first model demonstrated feasibility for cyclic peptides, version 2 adds systematic parameter-scaling studies and evaluation across diverse biological phenotypes.

#Key Features

  • Native peptide chemistry: By tokenizing raw SMILES rather than residue letters, the models represent non-canonical amino acids, cyclic backbones, and conjugated payloads that canonical protein language models cannot encode.
  • Compressed k-mer tokenizer: A custom 405-token vocabulary (160 single-atom plus 245 k-mer tokens) maps recurring substructural motifs to single tokens, shortening sequences by 38% for small molecules and 64% for natural peptides versus atom-level encoding, without accuracy loss.
  • Three training objectives: The suite compares masked language modeling (MLM), multi-task regression (MTR) against 99 RDKit physicochemical descriptors, and a hybrid dual-objective loss, clarifying when explicit supervision helps.
  • Nine released variants: Small (32M), base (114M), and large (337M) checkpoints are released for each objective on HuggingFace, supporting reproducible benchmarking and fine-tuning.
  • Open and commercially usable: Weights, code, and the pretraining corpus are released under CC BY 4.0, permitting commercial use.

#Technical Details

PeptideCLM-2 uses BERT-style transformer encoders with rotary positional embeddings (RoPE), SwiGLU activations, and pre-layer normalization, trained with a 25% span-masking rate. The three model scales span 32M parameters (6 layers, 384 hidden, 6 heads), 114M (12 layers, 768 hidden, 12 heads), and 337M (24 layers, 1024 hidden, 16 heads). Pretraining draws on a composite corpus of roughly 108M drug-like molecules from PubChem, 9.6M peptide sequences from ESMAtlas, and ~50K lipids from LIPID MAPS, balanced per epoch to avoid swamping peptides with small molecules. On downstream tasks the 337M model improves over baselines for membrane permeability (AUROC 0.830 vs 0.781), fibrillation propensity (AUROC 0.823 vs 0.579), blood stability (MCC 0.609 vs 0.537), cell penetration, tumor homing, and antimicrobial activity. A notable scaling finding: at small scale, explicit physicochemical supervision (MTR) substantially outperforms MLM (R² ≈ 0.38 vs 0.13), but at 337M parameters pure MLM converges with MTR (both R² ≈ 0.58), nearly doubling molecular-fingerprint baselines.

#Applications

PeptideCLM-2 is intended for therapeutic peptide engineering, where teams need property predictions for molecules that fall outside the reach of conventional protein models. Researchers can fine-tune the encoders to predict developability-relevant properties such as membrane diffusion, cell penetration, tumor homing, half-life, antimicrobial activity, and aggregation or fibrillation propensity, or use the learned embeddings as features for downstream screening and lead optimization. Because the models ingest SMILES directly, they accommodate cyclic and chemically modified peptides common in pharmaceutical pipelines.

#Impact

PeptideCLM-2 addresses a genuine blind spot between protein and small-molecule machine learning, providing an open, reproducibly benchmarked foundation for a therapeutic class of growing commercial importance. Its systematic comparison of self-supervised versus descriptor-supervised pretraining across scales offers practical guidance on when each strategy is worthwhile, a result relevant beyond peptides. As a preprint with released weights, code, and data, its long-term influence will depend on community adoption and independent validation; the model and dataset cards are currently concise, with the GitHub repository and preprint serving as the primary documentation.

Tags

property_predictionrepresentation_learningdrug_discoverytransformerbertfoundation_modelself_supervisedmulti_tasktherapeutic_peptidesproteomics