NTNU / KU Leuven / University of Surrey
A multimodal, domain-constrained foundation model that self-supervises on ~500K PubChem molecules to jointly predict nine thermophysical properties of small molecules.
MultiPUFFIN is a multimodal foundation model for predicting the thermophysical properties of small molecules, introduced as a 2026 preprint by Idelfonso B. R. Nogueira (Norwegian University of Science and Technology), Carine M. Rebello (KU Leuven), and Mumin Enis Leblebici and Erick Giovani Sperandio Nascimento (University of Surrey). It extends the PUFFIN line of work from single-property estimation to simultaneous multi-task prediction, targeting the kind of bulk physicochemical endpoints that drive chemical-engineering process design while remaining relevant to early-stage drug discovery.
The central problem MultiPUFFIN addresses is data scarcity. High-quality experimental measurements for properties such as boiling point, viscosity, or melting point are expensive and sparse, which limits supervised models. The authors first pretrain a shared molecular backbone with self-supervised objectives on roughly 500,000 unlabeled molecules drawn from PubChem, then fine-tune a multi-task head on a much smaller labeled corpus. This lets the model report strong accuracy while using far fewer labeled examples than text-pretrained baselines.
Its framing leans toward thermophysics and computational chemical engineering rather than bioactivity prediction, but the nine endpoints it covers, including aqueous solubility (log S), the octanol-water partition coefficient (log P), and hydration free energy, are routinely used in ADMET assessment and lead optimization, giving the model clear drug-discovery relevance.
MultiPUFFIN combines several modality-specific encoders, a 6-layer/8-head SMILES transformer (512-dimensional), a 4-layer graph convolutional network (256-dimensional), and a SchNet-style 3D encoder with four interaction blocks, fused through gated bidirectional cross-attention, for roughly 28 million total parameters. The supervised fine-tuning set comprises about 38,000 unique molecules (around 41,000 condition-resolved rows) aggregated from eleven public sources, including OPERA, NIST ThermoML, ECHA REACH, ChEMBL, AqSolDB, and FreeSolv. The authors report a mean test R² of 0.784 across the nine properties and state that MultiPUFFIN outperforms a fine-tuned ChemBERTa-2 on all nine endpoints despite using roughly 2,000x fewer labeled molecules than ChemBERTa-2's pretraining corpus. These results are from a preprint and have not yet been peer reviewed.
MultiPUFFIN targets computational chemists, chemical and process engineers, and cheminformaticians who need fast estimates of thermophysical properties for solvent selection, process simulation, formulation, and environmental fate modeling. Because several of its endpoints, solubility, log P, and hydration free energy in particular, are standard ADMET and developability filters, the model is also applicable to early-stage drug-discovery triage, where screening large virtual libraries on physicochemical criteria is routine. The condition-aware design makes it especially suited to settings where temperature, pressure, or pH materially change the measured property.
MultiPUFFIN illustrates how self-supervised pretraining combined with multimodal representations and physics-informed constraints can deliver strong property prediction under severe label scarcity, a recurring bottleneck across chemistry and materials science. Its reported label efficiency relative to ChemBERTa-2 is a notable claim for the thermophysics setting, which has historically relied on group-contribution methods and narrow QSPR models. Important caveats temper the assessment: the work is an unreviewed preprint, the benchmark comparisons await independent reproduction, and no public code, trained weights, model card, or data card had been released at the time of writing, so the training corpus and results cannot yet be independently verified. An open release would substantially aid adoption and validation.