bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule foundation models
Small moleculeProtein

Molexar

Peking University

A unified multimodal molecular foundation model built on Fragment-SELFIES, generating molecules conditioned on properties, pharmacophores, protein sequences, and binding pockets.

Released: June 2026
Parameters: 10.5 Million

Molexar is a multimodal molecular foundation model for drug design that unifies de novo generation, property-guided design, and target-conditioned generation within a single autoregressive decoder. It was developed by Haoyu Lin, Luhua Lai, Jianfeng Pei and colleagues at the Center for Quantitative Biology, Peking University, and released as a preprint in June 2026. The model addresses a persistent problem in generative chemistry: producing chemically valid, drug-like molecules that simultaneously satisfy diverse, heterogeneous design constraints — from scalar physicochemical properties to the geometry of a protein binding pocket.

The model is built on Fragment-SELFIES, a novel molecular language introduced by the same group. Fragment-SELFIES combines BRICS retrosynthetic fragment decomposition with a SELFIES-style, validity-preserving token encoding, so that each fragment is represented by compact, chemically interpretable tokens plus non-atomic attachment placeholders. This avoids the large, corpus-specific fragment vocabularies that earlier fragment-based methods required while guaranteeing that generated strings decode back to valid RDKit molecules.

Molexar follows a two-stage recipe familiar from large language models: an autoregressive decoder is first pretrained to learn the molecular distribution, then the same decoder is supervised fine-tuned on condition–molecule pairs spanning four modalities. This lets a single, compact model serve as both an unconditional generator and an instruction-followable, multi-condition design tool.

#Key Features

  • Fragment-SELFIES representation: A fragment-aware molecular language with validity-preserving decoding, giving 100% chemical validity in unconditional and fragment-constrained generation without large fragment-ID vocabularies.
  • Unified multimodal conditioning: One decoder is fine-tuned across scalar molecular properties, 2D pharmacophore fingerprints, ESMC-derived protein-sequence embeddings, and GVP-encoded binding-pocket geometry.
  • Multi-property instruction following: The model honors single- and multi-property targets (e.g., molecular weight, LogP, QED, synthetic accessibility) simultaneously during generation.
  • Compact and efficient: At roughly 10.5M language-model parameters (about 14.8M total), Molexar matches or exceeds substantially larger models on its benchmarks.
  • Open weights and code: Both checkpoints, the training code, and the Fragment-SELFIES tooling are released under the permissive MIT license.

#Technical Details

Molexar uses a Gemma2-style autoregressive decoder (RoPE positional encoding, grouped-query attention, sliding-window/full-attention layers, and logit softcapping) with 16 layers, a 256-dimension hidden size, a 256-token context length, and a 127-token Fragment-SELFIES vocabulary — about 10.5M LM parameters. The base checkpoint (molexar-10m-base) was pretrained on a UniChem-derived corpus of roughly 135.8M Fragment-SELFIES records (~33.9M molecule-condition rows); the molexar-10m-omni checkpoint adds supervised fine-tuning on nine scalar properties, 1,032-d pharmacophore fingerprints, 1,152-d mean-pooled ESMC-600M sequence embeddings, and 256-d GVP pocket features, with target-conditioned data drawn from SAIR (573,463 pairs) and PLINDER (21,770 pairs) after identity filtering. Training ran for 5 epochs on 8 H800 GPUs in bfloat16. On unconditional sampling the model reaches 1.0000 validity and 0.9997 uniqueness; on CrossDocked2020 target-conditioned generation it is competitive, with reported pocket-conditioned mean Vina scores around -7.4 and ~53% high-affinity hits, and it produces favorable safety and potency profiles on MolGenBench.

#Applications

Molexar targets the early stages of small-molecule drug discovery. Medicinal chemists can use it for de novo ideation, fragment-constrained elaboration around a retained substructure, property-optimized library design, and structure-based generation against a known protein sequence or binding pocket. Because conditioning modalities are interchangeable within one model, the same checkpoint supports workflows ranging from ligand-based design (pharmacophore or property targets) to structure-based design (sequence or pocket targets), making it useful to both computational chemistry teams and structural-biology-driven discovery pipelines.

#Impact

By demonstrating that a compact, ~10M-parameter decoder can deliver perfect chemical validity and competitive target-conditioned generation, Molexar challenges the assumption that multimodal molecular design requires very large models, lowering the compute barrier for generative drug design. Its accompanying Fragment-SELFIES language is a reusable contribution that could be adopted independently by other molecular language models seeking validity guarantees without unwieldy fragment vocabularies. As a recently posted preprint with permissively licensed weights and code, its real-world adoption and independent benchmarking remain to be established, and the reported results have not yet undergone peer review.

Citation

Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design

Preprint

Lin, H., et al. (2026) Molexar: A Unified Multimodal Molecular Foundation Model for Drug Design. arXiv.

DOI: 10.48550/arXiv.2606.25865

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

GitHub

Stars4
Forks1
Open Issues0
Contributors1
Last Push5d ago
LanguagePython
LicenseMIT

HuggingFace

Downloads11
Likes1
Last Modified5d ago
Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
82Open
Usability — can I run it?100
Reproducibility — can I retrain it?61
Model Openness Framework
Class III
Open Model

Tags

de_novo_designdrug_designdrug_discoveryfoundation_modelgenerativemolecular_generationmultimodaltransformer

Resources

GitHub RepositoryGitHub RepositoryResearch PaperOfficial WebsiteHuggingFace ModelHuggingFace Model