Overview

Pearl is a generative foundation model for predicting three-dimensional structures of protein-ligand complexes, developed by Genesis Molecular AI and released in October 2025. It is designed to address a central bottleneck in computational drug discovery: the accurate and physically valid placement of small-molecule ligands within protein binding sites, a task known as protein-ligand cofolding or pose prediction. Pearl positions itself directly in the same competitive landscape as AlphaFold 3, Chai-1, and Boltz-1, targeting the same benchmarks and use cases while introducing distinct architectural and data-generation innovations.

The defining characteristic of Pearl is its use of large-scale physics-based synthetic training data to overcome the chronic shortage of experimentally determined protein-ligand co-crystal structures. The public Protein Data Bank contains roughly 200,000 ligand-bound structures, but this corpus is heavily biased toward well-studied target classes and covers only a fraction of chemical and protein sequence space relevant to modern drug discovery. Genesis addresses this gap by generating a synthetic dataset of 582,065 structures across 910 proteins using physics-based simulation with diverse virtual ligands — roughly tripling the available training signal while maintaining physical realism. The authors demonstrate that model performance scales predictably with synthetic dataset size, framing this as a "synthetic data scaling law" for biomolecular structure prediction.

Pearl achieves 85.2% success on the Runs N' Poses benchmark (RMSD < 2 Å and physically valid, best of 5 samples), representing a 14.5% relative improvement over AlphaFold 3, and 84.7% on PoseBusters under the same conditions (14.2% relative improvement). On a proprietary internal crystal structure dataset at the more demanding RMSD < 1 Å threshold, Pearl shows approximately a 3.6-fold improvement over Boltz-1x.

Key Features

SO(3)-Equivariant Diffusion: Pearl's generative module is built from equivariant transformer (EqT) blocks that obey rotational symmetry by construction, ensuring that predictions are invariant to the orientation of the input complex and improving sample efficiency and generalization.
Physics-Based Synthetic Data Training: The model is trained on 582,065 synthetic protein-ligand structures generated via physics-based simulation, providing training signal far exceeding what is available from experimental databases and enabling the authors to establish scaling laws for synthetic data in structural biology.
Two Inference Modes: Pearl supports unconditional cofolding (predict binding pose from protein sequence and 2D ligand topology alone, useful for novel targets) and pocket-conditional cofolding (steer predictions using a known or hypothesized binding pocket, cofactors, or related ligand structures), covering both early-stage and lead-optimization drug discovery scenarios.
Multi-Chain Templating: Conditional inference uses a multi-chain templating mechanism that accepts both protein and non-polymeric components as structural context, enabling predictions guided by apo structures, related holo complexes, or partial experimental data.
Five-Stage Curriculum Training: Training follows a progressive curriculum that increases structural complexity and data diversity across five stages, beginning with small crops and simple tasks before introducing large crops, templating, and complex structural priors.

Technical Details

Pearl combines a trunk module and an SO(3)-equivariant diffusion module. The trunk uses lightweight triangle multiplication layers for pairwise representation learning that is position-independent, efficiently capturing inter-residue and residue-ligand spatial relationships at reduced computational cost relative to full attention over all atom pairs. The diffusion module consists of equivariant transformer blocks with gated nonlinearities for vector components; rotational and translational equivariance is achieved by combining architectural equivariance with data augmentation. Training uses a conservative bfloat16 mixed-precision strategy where computationally intensive trunk operations run in bf16 while numerically sensitive components (losses, coordinate projections, softmax operations) and all model weights remain in full fp32.

Training data combines curated PDB structures, monomer distillation data from OpenFold and the AlphaFold Database, and the novel synthetic corpus (all derived from structures deposited before September 30, 2021). The paper does not disclose total parameter count. Inference is accelerated by optimized kernels from cuEquivariance v0.6.0, providing 10–80% speedup at inference and 15% training speedup on NVIDIA H100 and H200 GPUs. On the Runs N' Poses benchmark (best@5), Pearl scores 85.2% at RMSD < 2 Å and 70.0% at RMSD < 1 Å. On PoseBusters (best@5), it scores 84.7% at RMSD < 2 Å and 72.4% at RMSD < 1 Å. Pocket-conditional mode on PoseBusters reaches 86.7% and 72.2% respectively.

Applications

Pearl is aimed at structure-based drug discovery pipelines that require accurate three-dimensional protein-ligand complex models. In early discovery, the unconditional cofolding mode predicts binding poses for novel targets from sequence and 2D ligand structure alone, supporting virtual screening and hit identification against understudied proteins. In lead optimization, the pocket-conditional mode leverages reference crystal structures or computationally predicted binding pockets to guide pose prediction with higher precision, relevant for fragment merging, scaffold hopping, and selectivity profiling across related targets. The model is accessible through Genesis Molecular AI's platform and is intended for integration into computational chemistry and drug design workflows where speed, physical validity, and accuracy of predicted poses are jointly important.

Impact

Pearl represents a technically credible challenge to AlphaFold 3 as the leading method for protein-ligand structure prediction, and its synthetic data approach offers a methodologically distinct path for improving model performance beyond what the experimental structural database alone can support. The demonstration of scaling laws for physics-based synthetic data in this domain is a notable conceptual contribution that may influence future model development well beyond Pearl itself. As a preprint released in October 2025, the work has not yet undergone peer review. Pearl is a proprietary model from a commercial company with no published open-source code or model weights as of its release; access is through Genesis Molecular AI's GEMS platform rather than as a freely downloadable resource, which limits its immediate utility for academic groups and creates a notable contrast with contemporaries like Boltz-1 and Chai-1 that offer open weights.

Overview

Key Features

SO(3)-Equivariant Diffusion: Pearl's generative module is built from equivariant transformer (EqT) blocks that obey rotational symmetry by construction, ensuring that predictions are invariant to the orientation of the input complex and improving sample efficiency and generalization.

Physics-Based Synthetic Data Training: The model is trained on 582,065 synthetic protein-ligand structures generated via physics-based simulation, providing training signal far exceeding what is available from experimental databases and enabling the authors to establish scaling laws for synthetic data in structural biology.

Two Inference Modes: Pearl supports unconditional cofolding (predict binding pose from protein sequence and 2D ligand topology alone, useful for novel targets) and pocket-conditional cofolding (steer predictions using a known or hypothesized binding pocket, cofactors, or related ligand structures), covering both early-stage and lead-optimization drug discovery scenarios.

Multi-Chain Templating: Conditional inference uses a multi-chain templating mechanism that accepts both protein and non-polymeric components as structural context, enabling predictions guided by apo structures, related holo complexes, or partial experimental data.

Five-Stage Curriculum Training: Training follows a progressive curriculum that increases structural complexity and data diversity across five stages, beginning with small crops and simple tasks before introducing large crops, templating, and complex structural priors.

Technical Details

Applications

Impact

Pearl

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

Pearl

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources