CoMole

Motif-aware graph diffusion model for controllable molecular generation that adapts to unseen properties by learning a lightweight task embedding.

Released: May 2026

CoMole (Controllable Molecular Generative Foundation Models) is a generative framework for controllable molecular design developed by Yihan Zhu, Yuhan Liu, Weijiang Li, Tengfei Luo, and Meng Jiang at the University of Notre Dame, released as a preprint in 2026. It addresses a recurring obstacle in generative chemistry: most property-conditioned molecular generators must be re-trained, or at least substantially fine-tuned, whenever a new target property is introduced. This makes them costly to adapt across the many objectives that arise in real drug-discovery and materials campaigns, where the set of properties of interest shifts from project to project.

CoMole reframes controllable generation as a transfer-learning problem. The generator is pretrained once on a corpus of molecules and then held fixed; adapting to a new, unseen property requires learning only a lightweight task embedding rather than updating the generator's weights. This fixed-backbone design lets a single pretrained model serve as a reusable foundation that can be steered toward novel objectives without re-training the expensive generative core, lowering the marginal cost of adding a new control target.

The model is built as a motif-aware graph diffusion transformer, generating molecular graphs by denoising over chemically meaningful substructures (motifs) rather than working purely at the atom level. This motif awareness encourages chemically valid outputs and gives the generation process structural priors grounded in common chemical building blocks. CoMole sits in the landscape between general chemical language models and bespoke property-conditioned generators, aiming for broad transferability with strong controllability.

Key Features

Frozen-generator transfer: Adapts to unseen molecular properties by learning only a small task embedding while keeping the pretrained generator's weights fixed, avoiding costly generator re-training for each new objective.
Motif-aware graph diffusion: Generates molecular graphs through a diffusion process operating over chemically meaningful motifs, embedding structural priors that promote valid, realistic molecules.
Three-stage training pipeline: Combines self-supervised pretraining, supervised fine-tuning, and reinforcement-learning alignment to progressively shape the generator toward controllable, high-quality outputs.
Strong controllability across targets: Reported to rank first in controllability across nine property targets in the authors' evaluation, with a 48.2% reduction in mean absolute error relative to compared baselines.
High intrinsic validity: Produces molecules with greater than 0.94 validity without post-processing or validity-enforcing repair steps, indicating the motif-based formulation captures chemical constraints during generation.

Technical Details

CoMole is a graph diffusion transformer trained with a three-stage pipeline: pretraining on a molecular corpus, supervised fine-tuning, and a reinforcement-learning alignment phase. The pretraining corpus combines roughly 13,000 polymers with about 10,000 small molecules drawn from MoleculeNet, spanning both polymer and drug-discovery chemical space. This is a comparatively modest pretraining scale relative to billion-molecule chemical language models, so reported generalization should be read in that context. Controllability is achieved at inference by conditioning the frozen generator on a learned task embedding specific to each target property, which is the only component trained when adapting to a new property.

In the authors' benchmarks, CoMole ranks first in controllability across nine property targets and reports a 48.2% mean-absolute-error reduction versus baselines, alongside validity exceeding 0.94 without post-processing. These figures are from the preprint and have not yet undergone peer review; independent reproduction would strengthen confidence in the comparisons.

Applications

CoMole targets computational chemists and materials scientists who need to generate molecules satisfying specific, and often changing, property constraints. Because adapting to a new objective requires training only a task embedding rather than the full generator, the framework is well suited to multi-objective drug-discovery pipelines and polymer design workflows where teams iterate over many properties. Practitioners can reuse one pretrained backbone across projects, lowering the engineering and compute overhead of standing up a new controllable generator each time a target changes.

Impact

CoMole contributes to a growing line of work on reusable generative foundation models for chemistry, emphasizing that controllability can be added cheaply through lightweight conditioning rather than repeated retraining of the generator. Its reported gains in controllability and validity suggest the fixed-backbone, motif-aware diffusion approach is a promising direction for property-steerable molecular design. Important caveats temper this: the work is a preprint, public weights, code, and HuggingFace artifacts have not been confirmed, the license is unstated, and the pretraining corpus is modest in scale. These factors make independent validation and open release important next steps for assessing the model's broader influence.

Citation

Controllable Molecular Generative Foundation Models

Preprint

Zhu, Y., et al. (2026) Controllable Molecular Generative Foundation Models. arXiv.

DOI: 10.48550/arXiv.2605.15354

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References29

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

23Closed

Usability — can I run it?15

Reproducibility — can I retrain it?18

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

Frozen-generator transfer: Adapts to unseen molecular properties by learning only a small task embedding while keeping the pretrained generator's weights fixed, avoiding costly generator re-training for each new objective.

Motif-aware graph diffusion: Generates molecular graphs through a diffusion process operating over chemically meaningful motifs, embedding structural priors that promote valid, realistic molecules.

Three-stage training pipeline: Combines self-supervised pretraining, supervised fine-tuning, and reinforcement-learning alignment to progressively shape the generator toward controllable, high-quality outputs.

Strong controllability across targets: Reported to rank first in controllability across nine property targets in the authors' evaluation, with a 48.2% reduction in mean absolute error relative to compared baselines.

High intrinsic validity: Produces molecules with greater than 0.94 validity without post-processing or validity-enforcing repair steps, indicating the motif-based formulation captures chemical constraints during generation.

Technical Details

Applications

Impact

CoMole

Key Features

Technical Details

Applications

Impact

Citation

Controllable Molecular Generative Foundation Models

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

CoMole

Key Features

Technical Details

Applications

Impact

Citation

Controllable Molecular Generative Foundation Models

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

CoMole

#Key Features

#Technical Details

#Applications

#Impact

Citation

Controllable Molecular Generative Foundation Models

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

CoMole

#Key Features

#Technical Details

#Applications

#Impact

Citation

Controllable Molecular Generative Foundation Models

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact