Molexar is a multimodal molecular foundation model for drug design that unifies de novo generation, property-guided design, and target-conditioned generation within a single autoregressive decoder. It was developed by Haoyu Lin, Luhua Lai, Jianfeng Pei and colleagues at the Center for Quantitative Biology, Peking University, and released as a preprint in June 2026. The model addresses a persistent problem in generative chemistry: producing chemically valid, drug-like molecules that simultaneously satisfy diverse, heterogeneous design constraints — from scalar physicochemical properties to the geometry of a protein binding pocket.

The model is built on Fragment-SELFIES, a novel molecular language introduced by the same group. Fragment-SELFIES combines BRICS retrosynthetic fragment decomposition with a SELFIES-style, validity-preserving token encoding, so that each fragment is represented by compact, chemically interpretable tokens plus non-atomic attachment placeholders. This avoids the large, corpus-specific fragment vocabularies that earlier fragment-based methods required while guaranteeing that generated strings decode back to valid RDKit molecules.

Molexar follows a two-stage recipe familiar from large language models: an autoregressive decoder is first pretrained to learn the molecular distribution, then the same decoder is supervised fine-tuned on condition–molecule pairs spanning four modalities. This lets a single, compact model serve as both an unconditional generator and an instruction-followable, multi-condition design tool.

Key Features

Fragment-SELFIES representation: A fragment-aware molecular language with validity-preserving decoding, giving 100% chemical validity in unconditional and fragment-constrained generation without large fragment-ID vocabularies.
Unified multimodal conditioning: One decoder is fine-tuned across scalar molecular properties, 2D pharmacophore fingerprints, ESMC-derived protein-sequence embeddings, and GVP-encoded binding-pocket geometry.
Multi-property instruction following: The model honors single- and multi-property targets (e.g., molecular weight, LogP, QED, synthetic accessibility) simultaneously during generation.
Compact and efficient: At roughly 10.5M language-model parameters (about 14.8M total), Molexar matches or exceeds substantially larger models on its benchmarks.
Open weights and code: Both checkpoints, the training code, and the Fragment-SELFIES tooling are released under the permissive MIT license.

Technical Details

Molexar uses a Gemma2-style autoregressive decoder (RoPE positional encoding, grouped-query attention, sliding-window/full-attention layers, and logit softcapping) with 16 layers, a 256-dimension hidden size, a 256-token context length, and a 127-token Fragment-SELFIES vocabulary — about 10.5M LM parameters. The base checkpoint (molexar-10m-base) was pretrained on a UniChem-derived corpus of roughly 135.8M Fragment-SELFIES records (~33.9M molecule-condition rows); the molexar-10m-omni checkpoint adds supervised fine-tuning on nine scalar properties, 1,032-d pharmacophore fingerprints, 1,152-d mean-pooled ESMC-600M sequence embeddings, and 256-d GVP pocket features, with target-conditioned data drawn from SAIR (573,463 pairs) and PLINDER (21,770 pairs) after identity filtering. Training ran for 5 epochs on 8 H800 GPUs in bfloat16. On unconditional sampling the model reaches 1.0000 validity and 0.9997 uniqueness; on CrossDocked2020 target-conditioned generation it is competitive, with reported pocket-conditioned mean Vina scores around -7.4 and ~53% high-affinity hits, and it produces favorable safety and potency profiles on MolGenBench.

Applications

Molexar targets the early stages of small-molecule drug discovery. Medicinal chemists can use it for de novo ideation, fragment-constrained elaboration around a retained substructure, property-optimized library design, and structure-based generation against a known protein sequence or binding pocket. Because conditioning modalities are interchangeable within one model, the same checkpoint supports workflows ranging from ligand-based design (pharmacophore or property targets) to structure-based design (sequence or pocket targets), making it useful to both computational chemistry teams and structural-biology-driven discovery pipelines.

Impact

By demonstrating that a compact, ~10M-parameter decoder can deliver perfect chemical validity and competitive target-conditioned generation, Molexar challenges the assumption that multimodal molecular design requires very large models, lowering the compute barrier for generative drug design. Its accompanying Fragment-SELFIES language is a reusable contribution that could be adopted independently by other molecular language models seeking validity guarantees without unwieldy fragment vocabularies. As a recently posted preprint with permissively licensed weights and code, its real-world adoption and independent benchmarking remain to be established, and the reported results have not yet undergone peer review.

Key Features

Fragment-SELFIES representation: A fragment-aware molecular language with validity-preserving decoding, giving 100% chemical validity in unconditional and fragment-constrained generation without large fragment-ID vocabularies.

Unified multimodal conditioning: One decoder is fine-tuned across scalar molecular properties, 2D pharmacophore fingerprints, ESMC-derived protein-sequence embeddings, and GVP-encoded binding-pocket geometry.

Multi-property instruction following: The model honors single- and multi-property targets (e.g., molecular weight, LogP, QED, synthetic accessibility) simultaneously during generation.

Compact and efficient: At roughly 10.5M language-model parameters (about 14.8M total), Molexar matches or exceeds substantially larger models on its benchmarks.

Open weights and code: Both checkpoints, the training code, and the Fragment-SELFIES tooling are released under the permissive MIT license.

Technical Details

Applications

Impact

Molexar

Key Features

Technical Details

Applications

Impact

Citation

Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design

Recent citations

Top citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Molexar

Key Features

Technical Details

Applications

Impact

Citation

Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design

Recent citations

Top citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Molexar

#Key Features

#Technical Details

#Applications

#Impact

Citation

Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design

Recent citations

Top citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Molexar

#Key Features

#Technical Details

#Applications

#Impact

Citation

Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design

Recent citations

Top citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact