bio.rodeo
HomeCompetitorsLeaderboardOrganizations
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

© 2026 bio.rodeo. All rights reserved.
Protein

HelixFold-Single

PaddlePaddle

MSA-free protein structure prediction that replaces multiple sequence alignments with a protein language model pre-trained on billions of sequences.

Released: 2023

Overview

HelixFold-Single is a protein structure prediction method developed by PaddlePaddle that eliminates the multiple sequence alignment (MSA) step that is a computational bottleneck in pipelines like AlphaFold2. Rather than searching protein databases for homologous sequences — a step that typically takes tens of minutes per query — HelixFold-Single substitutes MSA-derived evolutionary signals with representations from a large-scale protein language model (PLM) trained on hundreds of millions of primary sequences via self-supervised learning. The resulting end-to-end system predicts 3D atomic coordinates from a single amino acid sequence in seconds.

The paper was posted as an arXiv preprint in July 2022 and published in Nature Machine Intelligence in October 2023 under the title "A method for multiple-sequence-alignment-free protein structure prediction using a protein language model." The work is part of PaddlePaddle's broader PaddleHelix bio-computing platform, which also includes HelixFold, an efficient AlphaFold2 reimplementation in the PaddlePaddle deep learning framework.

HelixFold-Single sits alongside contemporaneous MSA-free approaches such as ESMFold (Meta AI) and OmegaFold, all of which demonstrated that protein language models can serve as viable alternatives to MSA-based evolutionary information. Its distinctive contribution is the three-component architectural design that integrates a pre-trained PLM with AlphaFold2's geometric learning machinery in a single differentiable pipeline.

Key Features

  • MSA-free inference: Eliminates database search entirely, reducing per-protein prediction time from tens of minutes to seconds and making large-scale proteome-wide predictions practical.
  • Three-component architecture: Combines a PLM Base (masked language model), an Adaptor module that bridges sequence representations, and a Geometric Modeling module derived from AlphaFold2's structure and IPA layers.
  • Self-supervised pre-training: The protein language model is trained with a masked language modeling objective on hundreds of millions of primary sequences, learning co-evolutionary patterns without requiring paired structural data.
  • Competitive accuracy on well-sampled proteins: Achieves accuracy comparable to AlphaFold2 with MSA input and outperforms RoseTTAFold with MSA input on CAMEO benchmark targets with large homologous families.
  • End-to-end differentiable: The full pipeline from primary sequence to 3D coordinates is trainable jointly, allowing the PLM representations and structure module to co-adapt during fine-tuning.
  • PaddlePaddle ecosystem integration: Implemented in PaddlePaddle and distributed as part of PaddleHelix, providing a unified framework for running multiple HelixFold variants.

Technical Details

HelixFold-Single is structured as three sequential modules. The PLM Base is a large-scale transformer trained on primary protein sequences using masked language modeling, where 15% of residues are randomly masked during pre-training on hundreds of millions of sequences drawn from public protein databases. The Adaptor is a learned projection layer that maps PLM token representations into the pair and single representations expected by AlphaFold2's downstream modules. The Geometric Modeling component adopts AlphaFold2's structure module with Invariant Point Attention (IPA), which places residues in 3D space as rigid-body frames and predicts backbone and side-chain torsion angles.

Training proceeds in two stages: self-supervised pre-training of the PLM on sequence data alone, followed by supervised fine-tuning of the full pipeline on Protein Data Bank (PDB) structures. On CAMEO benchmarks, HelixFold-Single achieves accuracy comparable to AlphaFold2 (with full MSA) on targets with large homologous families, and outperforms RoseTTAFold (with MSA) on those same targets. Performance degrades on proteins that are genuinely sequence-unique (orphan proteins with few homologs), reflecting the PLM's reliance on having learned co-evolutionary signals from related sequences in its training corpus. The model is implemented in PaddlePaddle and requires CUDA 11.2 and cuDNN 8.x for GPU inference.

Applications

HelixFold-Single is well suited for scenarios where prediction throughput matters more than maximally accurate individual structures. High-throughput drug discovery campaigns that screen thousands of protein variants benefit from its second-scale inference per sequence. Proteome-wide structural annotation projects — where running full MSA searches for every entry in a large proteome would be prohibitive — can use HelixFold-Single to generate rapid baseline structures. The model is also useful in environments with limited computational infrastructure, since removing the MSA search step eliminates the need for terabyte-scale sequence databases and the cluster resources needed to search them. For orphan proteins that lack homologs, HelixFold-Single provides a prediction path that is not blocked by an empty MSA, though accuracy on such targets is lower than for well-sampled protein families.

Impact

HelixFold-Single contributed to a growing body of work — alongside ESMFold and OmegaFold — that demonstrated the viability of protein language models as drop-in replacements for MSA-based evolutionary information in structure prediction. Its publication in Nature Machine Intelligence in 2023 provided peer-reviewed validation of the approach and concrete benchmark comparisons against AlphaFold2 and RoseTTAFold. The model is distributed within PaddlePaddle's open-source PaddleHelix platform, giving the research community an accessible implementation. A recognized limitation is that the accuracy advantage of full MSA-based methods re-emerges for proteins with sparse homolog coverage, meaning HelixFold-Single is best understood as a speed-accuracy tradeoff rather than a universal replacement for MSA-based pipelines. The broader MSA-free paradigm it helped establish continues to influence subsequent single-sequence prediction models and protein design workflows that require rapid structural context.

Citation

A method for multiple-sequence-alignment-free protein structure prediction using a protein language model

Fang, X., Wang, F., Liu, L., He, J., Lin, D., Xiang, Y., Zhang, X., Wu, H., Li, H., & Song, L. (2023). A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence, 5, 1341-1348.

DOI: 10.1038/s42256-023-00721-6

Metrics

GitHub

Stars1.1K
Forks226
Open Issues75
Contributors28
Last Push25d ago
LanguagePython

Citations

Total Citations78
Influential3
References46

Tags

structure predictionfoundation modelsingle sequence

Resources

GitHub RepositoryResearch PaperResearch PaperOfficial Website