HelixFold-Single

MSA-free protein structure prediction that replaces multiple sequence alignments with a protein language model pre-trained on billions of sequences.

Released: October 2023

HelixFold-Single is a protein structure prediction method developed by PaddlePaddle that eliminates the multiple sequence alignment (MSA) step that is a computational bottleneck in pipelines like AlphaFold2. Rather than searching protein databases for homologous sequences — a step that typically takes tens of minutes per query — HelixFold-Single substitutes MSA-derived evolutionary signals with representations from a large-scale protein language model (PLM) trained on hundreds of millions of primary sequences via self-supervised learning. The resulting end-to-end system predicts 3D atomic coordinates from a single amino acid sequence in seconds.

The paper was posted as an arXiv preprint in July 2022 and published in Nature Machine Intelligence in October 2023 under the title "A method for multiple-sequence-alignment-free protein structure prediction using a protein language model." The work is part of PaddlePaddle's broader PaddleHelix bio-computing platform, which also includes HelixFold, an efficient AlphaFold2 reimplementation in the PaddlePaddle deep learning framework.

HelixFold-Single sits alongside contemporaneous MSA-free approaches such as ESMFold (Meta AI) and OmegaFold, all of which demonstrated that protein language models can serve as viable alternatives to MSA-based evolutionary information. Its distinctive contribution is the three-component architectural design that integrates a pre-trained PLM with AlphaFold2's geometric learning machinery in a single differentiable pipeline.

Key Features

MSA-free inference: Eliminates database search entirely, reducing per-protein prediction time from tens of minutes to seconds and making large-scale proteome-wide predictions practical.
Three-component architecture: Combines a PLM Base (masked language model), an Adaptor module that bridges sequence representations, and a Geometric Modeling module derived from AlphaFold2's structure and IPA layers.
Self-supervised pre-training: The protein language model is trained with a masked language modeling objective on hundreds of millions of primary sequences, learning co-evolutionary patterns without requiring paired structural data.
Competitive accuracy on well-sampled proteins: Achieves accuracy comparable to AlphaFold2 with MSA input and outperforms RoseTTAFold with MSA input on CAMEO benchmark targets with large homologous families.
End-to-end differentiable: The full pipeline from primary sequence to 3D coordinates is trainable jointly, allowing the PLM representations and structure module to co-adapt during fine-tuning.
PaddlePaddle ecosystem integration: Implemented in PaddlePaddle and distributed as part of PaddleHelix, providing a unified framework for running multiple HelixFold variants.

Technical Details

HelixFold-Single is structured as three sequential modules. The PLM Base is a large-scale transformer trained on primary protein sequences using masked language modeling, where 15% of residues are randomly masked during pre-training on hundreds of millions of sequences drawn from public protein databases. The Adaptor is a learned projection layer that maps PLM token representations into the pair and single representations expected by AlphaFold2's downstream modules. The Geometric Modeling component adopts AlphaFold2's structure module with Invariant Point Attention (IPA), which places residues in 3D space as rigid-body frames and predicts backbone and side-chain torsion angles.

Training proceeds in two stages: self-supervised pre-training of the PLM on sequence data alone, followed by supervised fine-tuning of the full pipeline on Protein Data Bank (PDB) structures. On CAMEO benchmarks, HelixFold-Single achieves accuracy comparable to AlphaFold2 (with full MSA) on targets with large homologous families, and outperforms RoseTTAFold (with MSA) on those same targets. Performance degrades on proteins that are genuinely sequence-unique (orphan proteins with few homologs), reflecting the PLM's reliance on having learned co-evolutionary signals from related sequences in its training corpus. The model is implemented in PaddlePaddle and requires CUDA 11.2 and cuDNN 8.x for GPU inference.

Applications

HelixFold-Single is well suited for scenarios where prediction throughput matters more than maximally accurate individual structures. High-throughput drug discovery campaigns that screen thousands of protein variants benefit from its second-scale inference per sequence. Proteome-wide structural annotation projects — where running full MSA searches for every entry in a large proteome would be prohibitive — can use HelixFold-Single to generate rapid baseline structures. The model is also useful in environments with limited computational infrastructure, since removing the MSA search step eliminates the need for terabyte-scale sequence databases and the cluster resources needed to search them. For orphan proteins that lack homologs, HelixFold-Single provides a prediction path that is not blocked by an empty MSA, though accuracy on such targets is lower than for well-sampled protein families.

Impact

HelixFold-Single contributed to a growing body of work — alongside ESMFold and OmegaFold — that demonstrated the viability of protein language models as drop-in replacements for MSA-based evolutionary information in structure prediction. Its publication in Nature Machine Intelligence in 2023 provided peer-reviewed validation of the approach and concrete benchmark comparisons against AlphaFold2 and RoseTTAFold. The model is distributed within PaddlePaddle's open-source PaddleHelix platform, giving the research community an accessible implementation. A recognized limitation is that the accuracy advantage of full MSA-based methods re-emerges for proteins with sparse homolog coverage, meaning HelixFold-Single is best understood as a speed-accuracy tradeoff rather than a universal replacement for MSA-based pipelines. The broader MSA-free paradigm it helped establish continues to influence subsequent single-sequence prediction models and protein design workflows that require rapid structural context.

Citation

A method for multiple-sequence-alignment-free protein structure prediction using a protein language model

Fang, X., Wang, F., Liu, L., He, J., Lin, D., Xiang, Y., Zhang, X., Wu, H., Li, H., & Song, L. (2023). A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence, 5, 1341-1348.

DOI: 10.1038/s42256-023-00721-6

Recent citations

Papers that recently cited this model.

Electrochemical biosensing of glypican-3: Critical insights for circulating tumor cell biomarker analysis
Chih-Lung Wang, Ramadhass Keerthika Devi, Chih-Che Lin, et al.
Journal of Electroanalytical Chemistry · Oct 2026
0
DeepSSInter: Protein-protein contact prediction with a structure-aware protein language model.
Derek Huang, Jiamin Lv, Xuan Yao, et al.
Protein Science · Jun 2026
0
pLM-Guided Inverse Folding for Antibody Sequence Design
Valentin Noske, Felix Koulischer, Kathleen Marchal, et al.
bioRxiv · Jun 2026
0

Top citations

The most-cited papers that cite this model.

Machine learning in preclinical drug discovery
Denise B. Catacutan, Jeremie Alexander, Autumn Arnold, et al.
Nature Chemical Biology · Jul 2024
153
Easy and accurate protein structure prediction using ColabFold
Gyuri Kim, Sewon Lee, E. Levy, et al.
Nature Protocols · Oct 2024
123
Antimicrobial resistance crisis: could artificial intelligence be the solution?
Guangyu Liu, Dan Yu, Meimei Fan, et al.
Military Medical Research · Jan 2024
98Influential
Large language models for medicine: a survey
Yanxin Zheng, Wensheng Gan, Zefeng Chen, et al.
International Journal of Machine Learning and Cybernetics · May 2024
80
A survey of geometric graph neural networks: data structures, models and applications
Jiaqi Han, Jiacheng Cen, Liming Wu, et al.
Frontiers of Computer Science · Mar 2024
72

Citations

Total Citations91

Influential4

References46

GitHub

Stars1.1K

Forks227

Open Issues75

Contributors28

Last Push3mo ago

LanguagePython

Fields of citing research

Computer Science91%
Biology78%
Medicine67%
Chemistry18%
Engineering7%
Materials Science6%
Environmental Science3%
Mathematics1%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

12Closed

Usability — can I run it?11

Reproducibility — can I retrain it?14

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Research Paper Official Website Demo

Key Features

MSA-free inference: Eliminates database search entirely, reducing per-protein prediction time from tens of minutes to seconds and making large-scale proteome-wide predictions practical.

Three-component architecture: Combines a PLM Base (masked language model), an Adaptor module that bridges sequence representations, and a Geometric Modeling module derived from AlphaFold2's structure and IPA layers.

Self-supervised pre-training: The protein language model is trained with a masked language modeling objective on hundreds of millions of primary sequences, learning co-evolutionary patterns without requiring paired structural data.

Competitive accuracy on well-sampled proteins: Achieves accuracy comparable to AlphaFold2 with MSA input and outperforms RoseTTAFold with MSA input on CAMEO benchmark targets with large homologous families.

End-to-end differentiable: The full pipeline from primary sequence to 3D coordinates is trainable jointly, allowing the PLM representations and structure module to co-adapt during fine-tuning.

PaddlePaddle ecosystem integration: Implemented in PaddlePaddle and distributed as part of PaddleHelix, providing a unified framework for running multiple HelixFold variants.

Technical Details

Applications

Impact

Citation

A method for multiple-sequence-alignment-free protein structure prediction using a protein language model

DOI: 10.1038/s42256-023-00721-6

Top citations

The most-cited papers that cite this model.

Machine learning in preclinical drug discovery

Denise B. Catacutan, Jeremie Alexander, Autumn Arnold, et al.

Nature Chemical Biology · Jul 2024

153

Easy and accurate protein structure prediction using ColabFold

Gyuri Kim, Sewon Lee, E. Levy, et al.

Nature Protocols · Oct 2024

123

Antimicrobial resistance crisis: could artificial intelligence be the solution?

Guangyu Liu, Dan Yu, Meimei Fan, et al.

Military Medical Research · Jan 2024

98Influential

Large language models for medicine: a survey

Yanxin Zheng, Wensheng Gan, Zefeng Chen, et al.

International Journal of Machine Learning and Cybernetics · May 2024

A survey of geometric graph neural networks: data structures, models and applications

Jiaqi Han, Jiacheng Cen, Liming Wu, et al.

Frontiers of Computer Science · Mar 2024

HelixFold-Single

#Key Features

#Technical Details

#Applications

#Impact

Citation

A method for multiple-sequence-alignment-free protein structure prediction using a protein language model

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

HelixFold-Single

#Key Features

#Technical Details

#Applications

#Impact

Citation

A method for multiple-sequence-alignment-free protein structure prediction using a protein language model

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact