Multimodal molecular foundation model fusing SELFIES sequences, 2D graphs, text descriptions, and knowledge-graph embeddings via contrastive pretraining for property prediction.
Molecular representation learning underpins much of modern computational drug discovery, where the goal is to predict properties such as toxicity, solubility, or bioactivity from a molecule's structure. Most existing models, however, rely on a single view of a molecule, typically a sequence notation like SMILES or a 2D structural graph, and therefore miss complementary information available in other modalities such as natural-language descriptions or curated biological knowledge graphs.
SELFormerMM, developed by the HUBioDataLab at Hacettepe University and posted in March 2026, is a multimodal molecular foundation model that integrates four modalities: SELFIES sequence notations, 2D structural graphs, textual descriptions, and knowledge-graph-derived biological interaction data. It extends the earlier SELFormer chemical language model by aligning these modality-specific representations through contrastive pretraining on roughly three million molecules, producing embeddings that downstream models can use for property prediction.
By fusing chemical structure with biological context drawn from a knowledge graph, SELFormerMM aims to capture aspects of molecules that pure structure-based models overlook, and the authors report improved performance over single-modality alternatives on molecular property tasks.
SELFormerMM uses four modality-specific branches projected into a shared 768-dimensional space. The sequence branch is SELFormer (a RoBERTa-based model over SELFIES); the text branch uses frozen SciBERT embeddings; the structure branch uses frozen Uni-Mol features from 3D conformers; and the knowledge-graph branch uses DMGI embeddings derived from the CROssBARv2 biological knowledge graph. Non-linear MLP projection heads align the modalities via a supervised contrastive (SINCERE) loss during pretraining on approximately three million molecules. For downstream prediction, concatenated multimodal embeddings feed task heads covering binary classification, multilabel classification, and regression. The authors report gains over single-modality baselines on molecular property benchmarks. Code is licensed under GPL-3.0, and the preprint is released under CC-BY.
SELFormerMM targets molecular property prediction tasks central to early-stage drug discovery, including ADMET-style endpoints, toxicity, and bioactivity classification and regression. Cheminformatics and machine-learning researchers can fine-tune the pretrained model or use its multimodal embeddings as features for their own predictors. The knowledge-graph component makes it particularly relevant when biological context, such as known interactions, is expected to inform a molecule's behavior beyond its chemical structure alone.
SELFormerMM contributes to a growing line of multimodal molecular models that move past single-view representations by fusing structure with text and curated biological knowledge. Its open code and released checkpoints make it straightforward for the cheminformatics community to reproduce and extend. As a recent preprint, its reported advantages over single-modality baselines await independent benchmarking, and its reliance on several frozen external encoders means performance is partly bounded by those upstream components.