A motif-aware graph diffusion transformer for controllable molecular generation that transfers to unseen properties by learning only lightweight task embeddings with the generator frozen.
CoMole (Controllable Molecular Generative Foundation Models) is a generative framework for controllable molecular design developed by Yihan Zhu, Yuhan Liu, Weijiang Li, Tengfei Luo, and Meng Jiang at the University of Notre Dame, released as a preprint in 2026. It addresses a recurring obstacle in generative chemistry: most property-conditioned molecular generators must be re-trained, or at least substantially fine-tuned, whenever a new target property is introduced. This makes them costly to adapt across the many objectives that arise in real drug-discovery and materials campaigns, where the set of properties of interest shifts from project to project.
CoMole reframes controllable generation as a transfer-learning problem. The generator is pretrained once on a corpus of molecules and then held fixed; adapting to a new, unseen property requires learning only a lightweight task embedding rather than updating the generator's weights. This fixed-backbone design lets a single pretrained model serve as a reusable foundation that can be steered toward novel objectives without re-training the expensive generative core, lowering the marginal cost of adding a new control target.
The model is built as a motif-aware graph diffusion transformer, generating molecular graphs by denoising over chemically meaningful substructures (motifs) rather than working purely at the atom level. This motif awareness encourages chemically valid outputs and gives the generation process structural priors grounded in common chemical building blocks. CoMole sits in the landscape between general chemical language models and bespoke property-conditioned generators, aiming for broad transferability with strong controllability.
CoMole is a graph diffusion transformer trained with a three-stage pipeline: pretraining on a molecular corpus, supervised fine-tuning, and a reinforcement-learning alignment phase. The pretraining corpus combines roughly 13,000 polymers with about 10,000 small molecules drawn from MoleculeNet, spanning both polymer and drug-discovery chemical space. This is a comparatively modest pretraining scale relative to billion-molecule chemical language models, so reported generalization should be read in that context. Controllability is achieved at inference by conditioning the frozen generator on a learned task embedding specific to each target property, which is the only component trained when adapting to a new property.
In the authors' benchmarks, CoMole ranks first in controllability across nine property targets and reports a 48.2% mean-absolute-error reduction versus baselines, alongside validity exceeding 0.94 without post-processing. These figures are from the preprint and have not yet undergone peer review; independent reproduction would strengthen confidence in the comparisons.
CoMole targets computational chemists and materials scientists who need to generate molecules satisfying specific, and often changing, property constraints. Because adapting to a new objective requires training only a task embedding rather than the full generator, the framework is well suited to multi-objective drug-discovery pipelines and polymer design workflows where teams iterate over many properties. Practitioners can reuse one pretrained backbone across projects, lowering the engineering and compute overhead of standing up a new controllable generator each time a target changes.
CoMole contributes to a growing line of work on reusable generative foundation models for chemistry, emphasizing that controllability can be added cheaply through lightweight conditioning rather than repeated retraining of the generator. Its reported gains in controllability and validity suggest the fixed-backbone, motif-aware diffusion approach is a promising direction for property-steerable molecular design. Important caveats temper this: the work is a preprint, public weights, code, and HuggingFace artifacts have not been confirmed, the license is unstated, and the pretraining corpus is modest in scale. These factors make independent validation and open release important next steps for assessing the model's broader influence.