A trainable, open-source reimplementation of AlphaFold2 that matches its accuracy and runs 3-5x faster, enabling mechanistic research into protein structure learning.
The original AlphaFold2 transformed protein structure prediction, but its closed training code prevented researchers from investigating how the model learns, testing its generalization limits, or adapting it to new tasks. OpenFold, developed by the Aqlaboratory group and published in Nature Methods in 2024, addresses this gap by providing the first fully trainable, open-source reimplementation of the AlphaFold2 architecture.
OpenFold faithfully reproduces AlphaFold2's internal computations without modification, ensuring that mechanistic insights derived from studying OpenFold apply directly to the original model. Both the trained model weights and the OpenProteinSet training database are released under the permissive CC BY 4.0 license, hosted on the Registry of Open Data on AWS, making them freely accessible to the broader research community.
Beyond replication, the project has become an active platform for scientific discovery. By analyzing intermediate checkpoint structures during training, the OpenFold team revealed that the model learns protein geometry in a hierarchical, sequentially ordered fashion — a fundamental insight into how deep learning processes structural information.
OpenFold reproduces the AlphaFold2 architecture in full, including the Evoformer trunk and Structure Module, without any modifications to internal mathematical computations. Training uses OpenProteinSet, a curated database of protein multiple sequence alignments derived from large-scale homology searches against UniRef and the BFD database. This dataset is substantially larger than the training data described in the original AlphaFold2 paper and is itself publicly released.
Performance optimizations include improved attention kernel implementations and memory-efficient data loading, which together account for the 3-5x inference speedup relative to the reference codebase. In training benchmarks, the model reaches a median lDDT-Ca of approximately 0.9 on CAMEO test sets, consistent with AlphaFold2 performance. Generalization studies using structurally stratified subsampling showed that OpenFold maintains strong accuracy even when trained on datasets as small as 1,000 experimental structures and can tolerate near-complete removal of entire secondary structure classes from the training set — a surprising degree of robustness that was not previously appreciated.
OpenFold is primarily used by researchers who need to retrain, fine-tune, or systematically study a state-of-the-art structure prediction model. It integrates naturally into protein design workflows: predicted structures from OpenFold can be passed directly to sequence design tools such as ProteinMPNN to close the design-prediction loop. Industry adopters including Novo Nordisk, Outpace Bio, Cyrus Biotechnology, and Bayer Crop Science have used the OpenFold framework to adapt protein structure prediction to proprietary datasets and specialized tasks such as enzyme engineering, cell therapy design, and agrochemical target identification. The OpenFold Consortium has extended this foundation to OpenFold3, which adds support for predicting structures of protein-nucleic acid and protein-small molecule complexes, directly targeting drug discovery applications.
OpenFold meaningfully expanded what the structural biology community can do with AlphaFold2-class models. By releasing training code and data under a permissive license, the project lowered the barrier to building task-specific variants of the model and enabled mechanistic studies that are impossible with closed implementations. The finding that protein folding is learned hierarchically — with spatial dimensions acquired sequentially during training — is a concrete scientific contribution beyond mere replication. The project also established a template for open, reproducible reimplementation of high-impact biological AI systems. One important limitation to note is that OpenFold inherits the scope of AlphaFold2: it predicts single-chain or homo-oligomeric structures from sequence and MSA, and does not natively handle small molecules or post-translational modifications. Users requiring those capabilities should consider OpenFold3 or AlphaFold3.
Ahdritz G, Bouatta N, Floristean C, et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods. 2024;21(8):1514-1524.
DOI: 10.1038/s41592-024-02272-z