bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule

CoMole

University of Notre Dame

A motif-aware graph diffusion transformer for controllable molecular generation that transfers to unseen properties by learning only lightweight task embeddings with the generator frozen.

Released: May 2026

CoMole (Controllable Molecular Generative Foundation Models) is a generative framework for controllable molecular design developed by Yihan Zhu, Yuhan Liu, Weijiang Li, Tengfei Luo, and Meng Jiang at the University of Notre Dame, released as a preprint in 2026. It addresses a recurring obstacle in generative chemistry: most property-conditioned molecular generators must be re-trained, or at least substantially fine-tuned, whenever a new target property is introduced. This makes them costly to adapt across the many objectives that arise in real drug-discovery and materials campaigns, where the set of properties of interest shifts from project to project.

CoMole reframes controllable generation as a transfer-learning problem. The generator is pretrained once on a corpus of molecules and then held fixed; adapting to a new, unseen property requires learning only a lightweight task embedding rather than updating the generator's weights. This fixed-backbone design lets a single pretrained model serve as a reusable foundation that can be steered toward novel objectives without re-training the expensive generative core, lowering the marginal cost of adding a new control target.

The model is built as a motif-aware graph diffusion transformer, generating molecular graphs by denoising over chemically meaningful substructures (motifs) rather than working purely at the atom level. This motif awareness encourages chemically valid outputs and gives the generation process structural priors grounded in common chemical building blocks. CoMole sits in the landscape between general chemical language models and bespoke property-conditioned generators, aiming for broad transferability with strong controllability.

#Key Features

  • Frozen-generator transfer: Adapts to unseen molecular properties by learning only a small task embedding while keeping the pretrained generator's weights fixed, avoiding costly generator re-training for each new objective.
  • Motif-aware graph diffusion: Generates molecular graphs through a diffusion process operating over chemically meaningful motifs, embedding structural priors that promote valid, realistic molecules.
  • Three-stage training pipeline: Combines self-supervised pretraining, supervised fine-tuning, and reinforcement-learning alignment to progressively shape the generator toward controllable, high-quality outputs.
  • Strong controllability across targets: Reported to rank first in controllability across nine property targets in the authors' evaluation, with a 48.2% reduction in mean absolute error relative to compared baselines.
  • High intrinsic validity: Produces molecules with greater than 0.94 validity without post-processing or validity-enforcing repair steps, indicating the motif-based formulation captures chemical constraints during generation.

#Technical Details

CoMole is a graph diffusion transformer trained with a three-stage pipeline: pretraining on a molecular corpus, supervised fine-tuning, and a reinforcement-learning alignment phase. The pretraining corpus combines roughly 13,000 polymers with about 10,000 small molecules drawn from MoleculeNet, spanning both polymer and drug-discovery chemical space. This is a comparatively modest pretraining scale relative to billion-molecule chemical language models, so reported generalization should be read in that context. Controllability is achieved at inference by conditioning the frozen generator on a learned task embedding specific to each target property, which is the only component trained when adapting to a new property.

In the authors' benchmarks, CoMole ranks first in controllability across nine property targets and reports a 48.2% mean-absolute-error reduction versus baselines, alongside validity exceeding 0.94 without post-processing. These figures are from the preprint and have not yet undergone peer review; independent reproduction would strengthen confidence in the comparisons.

#Applications

CoMole targets computational chemists and materials scientists who need to generate molecules satisfying specific, and often changing, property constraints. Because adapting to a new objective requires training only a task embedding rather than the full generator, the framework is well suited to multi-objective drug-discovery pipelines and polymer design workflows where teams iterate over many properties. Practitioners can reuse one pretrained backbone across projects, lowering the engineering and compute overhead of standing up a new controllable generator each time a target changes.

#Impact

CoMole contributes to a growing line of work on reusable generative foundation models for chemistry, emphasizing that controllability can be added cheaply through lightweight conditioning rather than repeated retraining of the generator. Its reported gains in controllability and validity suggest the fixed-backbone, motif-aware diffusion approach is a promising direction for property-steerable molecular design. Important caveats temper this: the work is a preprint, public weights, code, and HuggingFace artifacts have not been confirmed, the license is unstated, and the pretraining corpus is modest in scale. These factors make independent validation and open release important next steps for assessing the model's broader influence.

Citation

Preprint

DOI: 10.48550/arXiv.2605.15354

DOI: 10.48550/arXiv.2605.15354

Openness

Unclassified
Missing required components

Tags

de_novo_designdiffusiondrug_discoveryfoundation_modelgenerativegraph_neural_networkmolecular_generationsmall_moleculetransfer_learningtransformer

Resources

Research Paper