bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule foundation models
Small molecule

MolDeBERTa

Florida International University

A DeBERTaV2-based SMILES encoder pretrained on 123M PubChem molecules with physicochemical and structural-similarity objectives for molecular representation learning.

Released: February 2026

MolDeBERTa is a self-supervised molecular encoder that learns representations of small molecules from their SMILES strings, developed in Fahad Saeed's lab (SaeedLab) at Florida International University and posted to bioRxiv in early 2026. It treats molecular structure as a "language" and applies a transformer encoder to learn embeddings useful for a range of downstream chemistry tasks, joining a family of SMILES-based foundation models such as ChemBERTa and MolBERT.

What distinguishes MolDeBERTa is its emphasis on baking physicochemical and structural priors directly into the learned latent space. Rather than relying solely on standard masked-token pretraining, it introduces additional objectives designed to make the embedding geometry reflect molecular properties and structural similarity, so that chemically similar molecules sit near one another in representation space.

The model is built on the DeBERTaV2 architecture and pretrained at scale on PubChem, with released weights available on Hugging Face — making it directly usable as a backbone for molecular property prediction and related cheminformatics tasks.

#Key Features

  • DeBERTaV2 SMILES encoder: Adapts the disentangled-attention DeBERTaV2 transformer to encode molecular SMILES strings.
  • Property- and structure-aware pretraining: Three pretraining objectives embed inductive biases for physicochemical properties and structural similarity directly into the latent space.
  • Large-scale pretraining: Trained on 123 million molecules drawn from PubChem.
  • Released weights: Pretrained weights are available on Hugging Face, supporting immediate fine-tuning and embedding extraction.
  • Public release: Weights and code are publicly available, though under a non-commercial, no-derivatives license (CC-BY-NC-ND 4.0).

#Technical Details

MolDeBERTa is a transformer encoder based on the DeBERTaV2 architecture, pretrained in a self-supervised manner on 123 million SMILES molecules from PubChem. The authors introduce three novel pretraining objectives that inject strong inductive biases for molecular properties and structural similarity into the learned representation. On nine downstream benchmarks the preprint reports up to a 16% reduction in regression error and classification gains of up to 3.0 ROC-AUC points relative to prior approaches. The model's weights (Hugging Face, SaeedLab/moldeberta) and code (GitHub, pcdslab/MolDeBERTa) are released under CC-BY-NC-ND 4.0 — a non-commercial, no-derivatives license. As a recent preprint, exact parameter counts should be confirmed against the model card and manuscript.

#Applications

MolDeBERTa is intended for cheminformatics and drug-discovery researchers who need informative molecular representations. Its pretrained embeddings can be fine-tuned for property prediction (both regression and classification), used for similarity search and molecular clustering, or integrated as a featurization backbone in virtual screening and lead-optimization pipelines, benefiting from publicly available weights.

#Impact

MolDeBERTa contributes a property- and structure-aware variant to the SMILES-based molecular foundation model landscape, reporting improvements over prior encoders on a broad benchmark suite. Its openly released weights lower the barrier to adoption for cheminformatics practitioners, though the non-commercial, no-derivatives license limits downstream reuse. As a recent preprint, the reported gains await independent replication, but the public release positions it for practical reuse.

Tags

molecular_property_predictionrepresentation_learningtransformerdebertafoundation_modelself_supervisedrepresentation_learningcheminformatics