Google DeepMind
AlphaFold-derived model from Google DeepMind that predicts missense variant pathogenicity across the entire human proteome with AuROC 0.940 on ClinVar.
AlphaMissense is a computational model developed by Google DeepMind and published in Science in September 2023, designed to predict whether missense variants — single amino acid changes in a protein sequence — are likely to be pathogenic or benign. Missense variants are among the most common genetic differences between individuals, and determining which ones contribute to disease has been a longstanding challenge in clinical genetics and functional genomics. Most missense variants are classified as "variants of uncertain significance" (VUS) in clinical databases, creating diagnostic uncertainty for patients and physicians. AlphaMissense addresses this gap at proteome scale.
The model adapts the architecture of AlphaFold 2, the landmark protein structure prediction system, repurposing its learned representations of protein sequence and structure to score the functional consequences of amino acid substitutions. Rather than predicting a new 3D structure for each variant, AlphaMissense uses the structural context embedded in AlphaFold's representations — alongside evolutionary signals derived from multiple sequence alignments — to estimate the probability that a given substitution disrupts protein function. The key innovation is that structural and evolutionary information, jointly encoded in the same neural network that AlphaFold trained on, provides a rich substrate for variant effect prediction without requiring explicit structure prediction at inference time.
AlphaMissense was released with a freely available precomputed database covering all 71 million possible single amino acid substitutions across 19,233 canonical human proteins. Predictions are also integrated into the EMBL-EBI AlphaFold Protein Structure Database and are accessible through the Ensembl Variant Effect Predictor, enabling direct use in clinical genomics workflows.
AlphaMissense closely follows the AlphaFold 2 architecture, inheriting its 93-million-parameter network built around the Evoformer stack and Structure Module. During the first training stage, the model is optimized for protein structure prediction identically to AlphaFold 2, but with an increased weight on the masked MSA reconstruction loss, which strengthens the network's sensitivity to individual sequence positions. In the second stage, the model is fine-tuned on a variant pathogenicity objective derived from human and primate population frequency databases following the PrimateAI approach: variants frequently observed across primate species are used as benign examples, while variants absent from population data are treated as putative pathogenic examples. Both structure prediction and variant classification objectives are jointly optimized during fine-tuning, preserving the structural representations while adding discriminative power for functional annotation.
At inference, the model takes a protein sequence and its MSA as input and outputs a continuous pathogenicity score for a queried amino acid substitution without re-computing a full 3D structure. On the ClinVar benchmark of 18,924 variants (9,462 pathogenic and 9,462 benign from 999 proteins), AlphaMissense achieves an area under the receiver operating characteristic curve (AuROC) of 0.940, outperforming prior methods including EVE (0.911) and CADD. At the 90% precision cutoff, 32% of all possible human missense variants (22.8 million) are classified as likely pathogenic and 57% (40.9 million) as likely benign, leaving approximately 11% in an ambiguous intermediate range.
AlphaMissense is directly applicable to clinical variant interpretation, where it can prioritize VUS candidates from exome and genome sequencing for follow-up functional studies or support diagnostic decision-making for rare disease patients. Researchers in functional genomics use the precomputed database as a fast prior for designing mutagenesis screens — for instance, targeting the subset of variants predicted pathogenic to validate AlphaMissense scores with deep mutational scanning or cell-based assays. Drug developers use the pathogenicity landscape to identify constitutively activating or loss-of-function mutations in disease-relevant proteins, informing target validation and understanding of disease-associated alleles. Population geneticists can overlay AlphaMissense scores onto genome-wide association study (GWAS) signals or linkage disequilibrium blocks to annotate likely functional variants within associated loci. Integration with the Ensembl VEP means that AlphaMissense predictions are immediately available as a standard annotation layer in standard genomic analysis pipelines used by clinical and academic sequencing centers worldwide.
AlphaMissense's publication in Science attracted immediate attention from both the clinical genetics and computational biology communities, as it offered a principled, proteome-wide solution to the VUS problem that had resisted resolution for decades. The model's transfer of structural learning to variant effect prediction demonstrated that the AlphaFold representations encode functional information far beyond what is needed for structure alone, opening a conceptual direction for adapting foundation models to downstream annotation tasks. Critically, the free availability of the precomputed database — with no per-query computational cost — lowers barriers for clinical laboratories that lack access to GPU infrastructure. Limitations include the fact that AlphaMissense does not predict the structural consequences of individual substitutions (it does not generate new 3D coordinates per variant), and its training signal from population frequency data can conflate pathogenicity with selection effects unrelated to clinical disease. Performance is also reduced for understudied protein families with sparse MSAs, where the evolutionary representation is less informative. Nonetheless, AlphaMissense is widely cited as one of the most useful general-purpose tools for proteome-scale missense annotation, and its integration into community databases has accelerated its adoption across both academic and clinical genomics.