IBM Research
Multi-view molecular foundation model that integrates graph, image, and text representations via late fusion for molecular property and target prediction.
BioMed Multi-View (also referred to by its architecture as MMELON — Multi-view Molecular Embedding with Late fusION) is a multi-modal molecular foundation model developed by IBM Research and released as a preprint in October 2024. The model addresses a fundamental representational challenge in computational chemistry: any single encoding of a molecule — whether as a text string, a 2D graph, or a 2D image — captures certain structural features well while losing others. Molecular graphs encode bond topology explicitly but discard the spatial intuitions that chemists use when looking at structural depictions. Molecular images preserve visual geometry but require alignment-robust feature extraction. SMILES strings enable language model pretraining on enormous corpora but are sensitive to canonicalization choices and lose graph-level symmetry information. MMELON combines all three views into a unified molecular encoder that is more robust across tasks than any individual view alone.
The architecture trains three single-view foundation models independently — a graph encoder, an image encoder, and a text encoder based on SMILES — each on up to 200 million molecules. These single-view models are then combined through a late fusion mechanism that produces a joint molecular embedding by aggregating view-specific representations rather than requiring deep fusion at every transformer layer. This design choice has practical advantages: each single-view model can be used independently when only one representation type is available, and the multi-view model can be ablated or extended with additional views (such as 3D conformers) without retraining from scratch. The model is validated on more than 120 downstream tasks spanning ADME properties, molecular solubility, and biological activity against protein targets, demonstrating that multi-view fusion consistently matches or exceeds the best-performing single view across diverse property types.
A particularly compelling application reported in the paper is a large-scale screening of compounds against more than 100 G protein-coupled receptor (GPCR) targets related to Alzheimer's disease. Using MMELON to identify high-affinity candidates across 33 disease-associated GPCRs, the team validated predictions through structure-based molecular docking and binding motif analysis, establishing a path from foundation model embeddings to experimentally testable drug leads.
MMELON's three view-specific encoders each use transformer-based architectures. The text encoder processes canonical SMILES strings using character-level tokenization with a transformer backbone pretrained with a masked language modeling objective on up to 200 million molecules from PubChem and ZINC. The graph encoder uses a graph transformer or graph attention network architecture with atom and bond features as node and edge inputs, also pretrained on a large-scale molecular graph dataset. The image encoder processes rasterized 2D structural depictions of molecules using a vision transformer or CNN backbone with self-supervised pretraining on molecular images. Late fusion aggregates representations from each view using learned attention weights or concatenation, producing the final 84 million-parameter MMELON encoder (released as biomed.sm.mv-te-84m on HuggingFace). During fine-tuning, task-specific heads are attached to the fused representation.
Across the 120+ benchmark tasks, the multi-view model consistently performs at or above the level of the best single view, demonstrating that late fusion provides robustness without sacrificing peak performance. On molecular solubility tasks (ESOL, FreeSolv), the multi-view model achieves RMSE values comparable to the state of the art. On ADME property benchmarks including Caco-2 permeability and plasma protein binding, multi-view fusion outperforms each single view individually. The GPCR screening application identified candidate binders for 33 Alzheimer's-associated receptor targets from a compound library exceeding 100 targets total, with structure-based validation confirming key binding motifs for multiple candidates.
BioMed Multi-View is designed for computational chemistry and drug discovery teams who need robust molecular representations that generalize across structurally diverse compound libraries and heterogeneous prediction tasks. In virtual screening campaigns, the multi-view encoder can generate embeddings for rapid nearest-neighbor search or machine learning-based activity prediction across large compound collections. For ADMET profiling — predicting absorption, distribution, metabolism, excretion, and toxicity — the multi-view representation provides richer features than single-view approaches, improving accuracy when labeled data is limited. Target engagement studies, particularly against challenging protein families like GPCRs, can leverage the GPCR screening pipeline demonstrated in the paper. The modular architecture also makes MMELON a useful component in multi-task drug discovery pipelines where different stages require different molecular featurizations — the same foundational representation can be shared across solubility prediction, toxicity filtering, and binding affinity estimation.
BioMed Multi-View contributes a principled solution to the molecular representation problem by demonstrating that late-fusion integration of independently pretrained view-specific encoders provides more consistent and robust downstream performance than any single view. This finding is relevant to the broader machine learning community working on multi-modal learning, as it validates late fusion as a viable alternative to expensive early-fusion or cross-attention approaches for modalities with heterogeneous structure. The HuggingFace model checkpoint (biomed.sm.mv-te-84m) and GitHub implementation are publicly released, supporting community adoption and comparison. MMELON is part of IBM Research's BioMedical Foundation Models (BMFM) initiative alongside MAMMAL, MoLFormer-XL, and BioMed Multi-Omic, positioning it within a coherent ecosystem of biological AI tools. The primary limitation of the current model is the absence of 3D conformational views, which limits accuracy on tasks where binding geometry is the dominant determinant of activity; adding conformer-based encoders to the late-fusion framework is a natural next step.