MolDeBERTa

SMILES molecular encoder on a DeBERTaV2 backbone, pretrained on 123M PubChem molecules with physicochemical and structural-similarity objectives.

Released: February 2026

MolDeBERTa is a self-supervised molecular encoder that learns representations of small molecules from their SMILES strings, developed in Fahad Saeed's lab (SaeedLab) at Florida International University and posted to bioRxiv in early 2026. It treats molecular structure as a "language" and applies a transformer encoder to learn embeddings useful for a range of downstream chemistry tasks, joining a family of SMILES-based foundation models such as ChemBERTa and MolBERT.

What distinguishes MolDeBERTa is its emphasis on baking physicochemical and structural priors directly into the learned latent space. Rather than relying solely on standard masked-token pretraining, it introduces additional objectives designed to make the embedding geometry reflect molecular properties and structural similarity, so that chemically similar molecules sit near one another in representation space.

The model is built on the DeBERTaV2 architecture and pretrained at scale on PubChem, with released weights available on Hugging Face — making it directly usable as a backbone for molecular property prediction and related cheminformatics tasks.

Key Features

DeBERTaV2 SMILES encoder: Adapts the disentangled-attention DeBERTaV2 transformer to encode molecular SMILES strings.
Property- and structure-aware pretraining: Three pretraining objectives embed inductive biases for physicochemical properties and structural similarity directly into the latent space.
Large-scale pretraining: Trained on 123 million molecules drawn from PubChem.
Released weights: Pretrained weights are available on Hugging Face, supporting immediate fine-tuning and embedding extraction.
Public release: Weights and code are publicly available, though under a non-commercial, no-derivatives license (CC-BY-NC-ND 4.0).

Technical Details

MolDeBERTa is a transformer encoder based on the DeBERTaV2 architecture, pretrained in a self-supervised manner on 123 million SMILES molecules from PubChem. The authors introduce three novel pretraining objectives that inject strong inductive biases for molecular properties and structural similarity into the learned representation. On nine downstream benchmarks the preprint reports up to a 16% reduction in regression error and classification gains of up to 3.0 ROC-AUC points relative to prior approaches. The model's weights (Hugging Face, SaeedLab/moldeberta) and code (GitHub, pcdslab/MolDeBERTa) are released under CC-BY-NC-ND 4.0 — a non-commercial, no-derivatives license. As a recent preprint, exact parameter counts should be confirmed against the model card and manuscript.

Applications

MolDeBERTa is intended for cheminformatics and drug-discovery researchers who need informative molecular representations. Its pretrained embeddings can be fine-tuned for property prediction (both regression and classification), used for similarity search and molecular clustering, or integrated as a featurization backbone in virtual screening and lead-optimization pipelines, benefiting from publicly available weights.

Impact

MolDeBERTa contributes a property- and structure-aware variant to the SMILES-based molecular foundation model landscape, reporting improvements over prior encoders on a broad benchmark suite. Its openly released weights lower the barrier to adoption for cheminformatics practitioners, though the non-commercial, no-derivatives license limits downstream reuse. As a recent preprint, the reported gains await independent replication, but the public release positions it for practical reuse.

Citation

MolDeBERTa: Foundational Model for Physicochemical and Structural-Informed Molecular Representation Learning

Oliveira, G. B. d. & Saeed, F. (2026) MolDeBERTa: Foundational Model for Physicochemical and Structural-Informed Molecular Representation Learning. bioRxiv.

DOI: 10.64898/2026.02.15.706011

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References31

GitHub

Stars4

Forks1

Open Issues0

Contributors1

Last Push1mo ago

LanguagePython

HuggingFace

Downloads6

Likes0

Last Modified1mo ago

Pipelinefeature-extraction

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

25Closed

Usability — can I run it?19

Reproducibility — can I retrain it?18

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

DeBERTaV2 SMILES encoder: Adapts the disentangled-attention DeBERTaV2 transformer to encode molecular SMILES strings.

Property- and structure-aware pretraining: Three pretraining objectives embed inductive biases for physicochemical properties and structural similarity directly into the latent space.

Large-scale pretraining: Trained on 123 million molecules drawn from PubChem.

Released weights: Pretrained weights are available on Hugging Face, supporting immediate fine-tuning and embedding extraction.

Public release: Weights and code are publicly available, though under a non-commercial, no-derivatives license (CC-BY-NC-ND 4.0).

Technical Details

Applications

Impact

Citation

MolDeBERTa: Foundational Model for Physicochemical and Structural-Informed Molecular Representation Learning

Oliveira, G. B. d. & Saeed, F. (2026) MolDeBERTa: Foundational Model for Physicochemical and Structural-Informed Molecular Representation Learning. bioRxiv.

DOI: 10.64898/2026.02.15.706011

MolDeBERTa

Key Features

Technical Details

Applications

Impact

Citation

MolDeBERTa: Foundational Model for Physicochemical and Structural-Informed Molecular Representation Learning

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MolDeBERTa

Key Features

Technical Details

Applications

Impact

Citation

MolDeBERTa: Foundational Model for Physicochemical and Structural-Informed Molecular Representation Learning

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MolDeBERTa

#Key Features

#Technical Details

#Applications

#Impact

Citation

MolDeBERTa: Foundational Model for Physicochemical and Structural-Informed Molecular Representation Learning

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

MolDeBERTa

#Key Features

#Technical Details

#Applications

#Impact

Citation

MolDeBERTa: Foundational Model for Physicochemical and Structural-Informed Molecular Representation Learning

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact