Florida International University
A DeBERTaV2-based SMILES encoder pretrained on 123M PubChem molecules with physicochemical and structural-similarity objectives for molecular representation learning.
MolDeBERTa is a self-supervised molecular encoder that learns representations of small molecules from their SMILES strings, developed in Fahad Saeed's lab (SaeedLab) at Florida International University and posted to bioRxiv in early 2026. It treats molecular structure as a "language" and applies a transformer encoder to learn embeddings useful for a range of downstream chemistry tasks, joining a family of SMILES-based foundation models such as ChemBERTa and MolBERT.
What distinguishes MolDeBERTa is its emphasis on baking physicochemical and structural priors directly into the learned latent space. Rather than relying solely on standard masked-token pretraining, it introduces additional objectives designed to make the embedding geometry reflect molecular properties and structural similarity, so that chemically similar molecules sit near one another in representation space.
The model is built on the DeBERTaV2 architecture and pretrained at scale on PubChem, with released weights available on Hugging Face — making it directly usable as a backbone for molecular property prediction and related cheminformatics tasks.
MolDeBERTa is a transformer encoder based on the DeBERTaV2 architecture, pretrained in a self-supervised manner on 123 million SMILES molecules from PubChem. The authors introduce three novel pretraining objectives that inject strong inductive biases for molecular properties and structural similarity into the learned representation. On nine downstream benchmarks the preprint reports up to a 16% reduction in regression error and classification gains of up to 3.0 ROC-AUC points relative to prior approaches. The model's weights (Hugging Face, SaeedLab/moldeberta) and code (GitHub, pcdslab/MolDeBERTa) are released under CC-BY-NC-ND 4.0 — a non-commercial, no-derivatives license. As a recent preprint, exact parameter counts should be confirmed against the model card and manuscript.
MolDeBERTa is intended for cheminformatics and drug-discovery researchers who need informative molecular representations. Its pretrained embeddings can be fine-tuned for property prediction (both regression and classification), used for similarity search and molecular clustering, or integrated as a featurization backbone in virtual screening and lead-optimization pipelines, benefiting from publicly available weights.
MolDeBERTa contributes a property- and structure-aware variant to the SMILES-based molecular foundation model landscape, reporting improvements over prior encoders on a broad benchmark suite. Its openly released weights lower the barrier to adoption for cheminformatics practitioners, though the non-commercial, no-derivatives license limits downstream reuse. As a recent preprint, the reported gains await independent replication, but the public release positions it for practical reuse.