A retrieval-augmented foundation model for matched molecular pair transformations that proposes controllable, medicinal-chemistry-style analog edits.
MMPT-RAG is a retrieval-augmented foundation model for matched molecular pair (MMP) transformations—the local, single-edit modifications that medicinal chemists routinely use to design analogs of a lead compound. Existing machine-learning approaches to analog generation tend to fall into two camps: whole-molecule generative models with limited control over where and how a molecule is edited, or MMP-style models trained in restricted settings with small models. MMPT-RAG reframes the task as variable-to-variable generation, learning the substructure edits themselves from large-scale transformation data so that the model proposes chemically meaningful, localized changes.
The model was introduced in February 2026 by Bo Pan, Liang Zhao, and colleagues at Emory University as an arXiv preprint. Its defining feature is retrieval augmentation: rather than relying solely on parametric knowledge, MMPT-RAG retrieves external reference analogs and uses them as contextual guidance when generating a transformation. This lets the model condition each edit on relevant precedent, helping it recapitulate the kind of intuition an experienced medicinal chemist brings to analog design.
The authors report gains in diversity, novelty, and controllability across chemical and patent datasets. As a recent preprint, MMPT-RAG does not yet have released weights or code, and its reported results should be treated as preprint-stage.
MMPT-RAG combines a foundation model trained on large-scale matched-molecular-pair transformations with a retrieval-augmented generation (RAG) layer. The generative core treats analog design as a variable-to-variable problem—mapping the variable region of a molecule to a transformed variable region—so edits remain local and controllable. At generation time, the retrieval component surfaces external reference analogs that serve as in-context guidance, steering the model toward precedented, chemically sensible edits. The authors evaluate on chemical and patent datasets and report improvements in diversity, novelty, and controllability relative to prior approaches. The preprint (CC BY 4.0) does not disclose a specific parameter count, and no public weights or code accompany it at the time of writing.
MMPT-RAG is aimed at lead optimization and analog design in drug discovery, where medicinal chemists iteratively make small structural edits to improve potency, selectivity, or ADMET properties. By proposing controllable, precedent-grounded MMP transformations, the model could assist computational chemists in enumerating high-quality analog ideas, prioritizing edits, and exploring chemical space around a hit or lead in a way that mirrors expert intuition.
MMPT-RAG sits at the intersection of two active trends: foundation models for molecular generation and retrieval augmentation for grounding generative systems in external knowledge. By bringing RAG to matched-molecular-pair editing, it offers a path toward more controllable, interpretable analog design than whole-molecule generators. As a February 2026 preprint without released weights, its downstream influence and the robustness of its diversity/novelty/controllability gains await independent validation.