A unified multimodal molecular foundation model built on Fragment-SELFIES, generating molecules conditioned on properties, pharmacophores, protein sequences, and binding pockets.
Molexar is a multimodal molecular foundation model for drug design that unifies de novo generation, property-guided design, and target-conditioned generation within a single autoregressive decoder. It was developed by Haoyu Lin, Luhua Lai, Jianfeng Pei and colleagues at the Center for Quantitative Biology, Peking University, and released as a preprint in June 2026. The model addresses a persistent problem in generative chemistry: producing chemically valid, drug-like molecules that simultaneously satisfy diverse, heterogeneous design constraints — from scalar physicochemical properties to the geometry of a protein binding pocket.
The model is built on Fragment-SELFIES, a novel molecular language introduced by the same group. Fragment-SELFIES combines BRICS retrosynthetic fragment decomposition with a SELFIES-style, validity-preserving token encoding, so that each fragment is represented by compact, chemically interpretable tokens plus non-atomic attachment placeholders. This avoids the large, corpus-specific fragment vocabularies that earlier fragment-based methods required while guaranteeing that generated strings decode back to valid RDKit molecules.
Molexar follows a two-stage recipe familiar from large language models: an autoregressive decoder is first pretrained to learn the molecular distribution, then the same decoder is supervised fine-tuned on condition–molecule pairs spanning four modalities. This lets a single, compact model serve as both an unconditional generator and an instruction-followable, multi-condition design tool.
Molexar uses a Gemma2-style autoregressive decoder (RoPE positional encoding, grouped-query
attention, sliding-window/full-attention layers, and logit softcapping) with 16 layers, a
256-dimension hidden size, a 256-token context length, and a 127-token Fragment-SELFIES
vocabulary — about 10.5M LM parameters. The base checkpoint (molexar-10m-base) was
pretrained on a UniChem-derived corpus of roughly 135.8M Fragment-SELFIES records
(~33.9M molecule-condition rows); the molexar-10m-omni checkpoint adds supervised
fine-tuning on nine scalar properties, 1,032-d pharmacophore fingerprints, 1,152-d mean-pooled
ESMC-600M sequence embeddings, and 256-d GVP pocket features, with target-conditioned data
drawn from SAIR (573,463 pairs) and PLINDER (21,770 pairs) after identity filtering. Training
ran for 5 epochs on 8 H800 GPUs in bfloat16. On unconditional sampling the model reaches
1.0000 validity and 0.9997 uniqueness; on CrossDocked2020 target-conditioned generation it is
competitive, with reported pocket-conditioned mean Vina scores around -7.4 and ~53% high-affinity
hits, and it produces favorable safety and potency profiles on MolGenBench.
Molexar targets the early stages of small-molecule drug discovery. Medicinal chemists can use it for de novo ideation, fragment-constrained elaboration around a retained substructure, property-optimized library design, and structure-based generation against a known protein sequence or binding pocket. Because conditioning modalities are interchangeable within one model, the same checkpoint supports workflows ranging from ligand-based design (pharmacophore or property targets) to structure-based design (sequence or pocket targets), making it useful to both computational chemistry teams and structural-biology-driven discovery pipelines.
By demonstrating that a compact, ~10M-parameter decoder can deliver perfect chemical validity and competitive target-conditioned generation, Molexar challenges the assumption that multimodal molecular design requires very large models, lowering the compute barrier for generative drug design. Its accompanying Fragment-SELFIES language is a reusable contribution that could be adopted independently by other molecular language models seeking validity guarantees without unwieldy fragment vocabularies. As a recently posted preprint with permissively licensed weights and code, its real-world adoption and independent benchmarking remain to be established, and the reported results have not yet undergone peer review.
Lin, H., et al. (2026) Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design. arXiv.
DOI: 10.48550/arXiv.2606.25865Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data