bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule foundation models
Small molecule

MultiPUFFIN

NTNU / KU Leuven / University of Surrey

A multimodal, domain-constrained foundation model that self-supervises on ~500K PubChem molecules to jointly predict nine thermophysical properties of small molecules.

Released: March 2026
Parameters: 28 Million

MultiPUFFIN is a multimodal foundation model for predicting the thermophysical properties of small molecules, introduced as a 2026 preprint by Idelfonso B. R. Nogueira (Norwegian University of Science and Technology), Carine M. Rebello (KU Leuven), and Mumin Enis Leblebici and Erick Giovani Sperandio Nascimento (University of Surrey). It extends the PUFFIN line of work from single-property estimation to simultaneous multi-task prediction, targeting the kind of bulk physicochemical endpoints that drive chemical-engineering process design while remaining relevant to early-stage drug discovery.

The central problem MultiPUFFIN addresses is data scarcity. High-quality experimental measurements for properties such as boiling point, viscosity, or melting point are expensive and sparse, which limits supervised models. The authors first pretrain a shared molecular backbone with self-supervised objectives on roughly 500,000 unlabeled molecules drawn from PubChem, then fine-tune a multi-task head on a much smaller labeled corpus. This lets the model report strong accuracy while using far fewer labeled examples than text-pretrained baselines.

Its framing leans toward thermophysics and computational chemical engineering rather than bioactivity prediction, but the nine endpoints it covers, including aqueous solubility (log S), the octanol-water partition coefficient (log P), and hydration free energy, are routinely used in ADMET assessment and lead optimization, giving the model clear drug-discovery relevance.

#Key Features

  • Self-supervised pretraining at scale: A shared backbone is pretrained on ~500,000 unlabeled PubChem molecules (filtered to 5–60 heavy atoms), reducing reliance on scarce labeled thermophysical data.
  • Multimodal molecular encoding: Each molecule is represented through SMILES text, a 2D molecular graph, and 3D conformer geometry, fused via bidirectional cross-modal attention so the three views inform one another.
  • Nine-property multi-task prediction: A single model jointly predicts log S, log P, hydration free energy, boiling point, vapor pressure, viscosity, melting point, flash point, and heat capacity.
  • Condition-aware refinement: Five refinement modules adjust predictions for experimental conditions (temperature, pH, pressure, polymorph, and measurement method), matching how thermophysical data are actually reported.
  • Domain-constrained prediction heads: Per-property heads incorporate physics-based and cheminformatics priors (e.g., thermophysical equations, Joback group contributions, RDKit fragments), constraining outputs to chemically plausible ranges.

#Technical Details

MultiPUFFIN combines several modality-specific encoders, a 6-layer/8-head SMILES transformer (512-dimensional), a 4-layer graph convolutional network (256-dimensional), and a SchNet-style 3D encoder with four interaction blocks, fused through gated bidirectional cross-attention, for roughly 28 million total parameters. The supervised fine-tuning set comprises about 38,000 unique molecules (around 41,000 condition-resolved rows) aggregated from eleven public sources, including OPERA, NIST ThermoML, ECHA REACH, ChEMBL, AqSolDB, and FreeSolv. The authors report a mean test R² of 0.784 across the nine properties and state that MultiPUFFIN outperforms a fine-tuned ChemBERTa-2 on all nine endpoints despite using roughly 2,000x fewer labeled molecules than ChemBERTa-2's pretraining corpus. These results are from a preprint and have not yet been peer reviewed.

#Applications

MultiPUFFIN targets computational chemists, chemical and process engineers, and cheminformaticians who need fast estimates of thermophysical properties for solvent selection, process simulation, formulation, and environmental fate modeling. Because several of its endpoints, solubility, log P, and hydration free energy in particular, are standard ADMET and developability filters, the model is also applicable to early-stage drug-discovery triage, where screening large virtual libraries on physicochemical criteria is routine. The condition-aware design makes it especially suited to settings where temperature, pressure, or pH materially change the measured property.

#Impact

MultiPUFFIN illustrates how self-supervised pretraining combined with multimodal representations and physics-informed constraints can deliver strong property prediction under severe label scarcity, a recurring bottleneck across chemistry and materials science. Its reported label efficiency relative to ChemBERTa-2 is a notable claim for the thermophysics setting, which has historically relied on group-contribution methods and narrow QSPR models. Important caveats temper the assessment: the work is an unreviewed preprint, the benchmark comparisons await independent reproduction, and no public code, trained weights, model card, or data card had been released at the time of writing, so the training corpus and results cannot yet be independently verified. An open release would substantially aid adoption and validation.

Tags

molecular_property_predictiondrug_discoverytransformergraph_neural_networkfoundation_modelself_supervisedmultimodalmulti_tasksmall_molecule