Stony Brook University
AlphaFold fine-tuned via OpenFold on 944 high-resolution MHC-peptide crystal structures, achieving median peptide RMSD of 0.65 Å on held-out complexes.
MHC-Fine is a specialized variant of AlphaFold fine-tuned exclusively on high-resolution MHC-peptide crystal structures to improve the accuracy of structural predictions for major histocompatibility complex (MHC) complexes with bound peptides. Developed by Ernest Glukhov, Dmytro Kalitin, Darya Stepanenko, Yimin Zhu, Thu Nguyen, George Jones, Carlos Simmerling, Julie C. Mitchell, Sandor Vajda, Ken A. Dill, Dzmitry Padhorny, and Dima Kozakov at Stony Brook University, with collaborators from Oak Ridge National Laboratory and Boston University, the work was first posted as a bioRxiv preprint in November 2023 and subsequently published in Biophysical Journal in 2024.
The MHC-peptide system poses a particularly demanding structural prediction challenge. MHC molecules present peptide fragments for immune surveillance, and the precise geometry of how a peptide sits within the MHC binding groove — including its backbone conformation, side-chain orientations, and anchor residue interactions — determines whether a T cell receptor will recognize the complex. AlphaFold 2 and AlphaFold-Multimer are trained on the broad diversity of the Protein Data Bank, which provides general structural competence but insufficient specialization for the stereotyped, groove-filling geometry of peptide-MHC interactions. MHC-Fine directly addresses this gap through domain-specific fine-tuning on a curated structural dataset, sharpening AlphaFold's predictions for this immunologically important class of complexes.
A key implementation choice distinguishes MHC-Fine from other AlphaFold fine-tuning approaches: rather than modifying the original JAX-based AlphaFold codebase directly, the developers built on OpenFold — a PyTorch reimplementation of AlphaFold that supports efficient gradient-based fine-tuning through standard deep learning frameworks. This choice provides substantially more flexibility for training modifications, learning rate scheduling, and integration with the broader PyTorch ecosystem, and makes the training procedure more accessible to researchers without JAX expertise.
MHC-Fine uses the OpenFold framework — a memory-efficient, GPU-friendly PyTorch reproduction of AlphaFold 2 — as its training foundation. The final training dataset consisted of 944 high-resolution MHC-peptide crystal structures collected from the Protein Data Bank, filtered for resolution quality and cleaned to remove redundancies and low-quality structures. This dataset covers MHC class I and class II complexes across multiple human HLA alleles and selected non-human species. Fine-tuning proceeds from the AlphaFold 2 pretrained weights, applying supervised learning on the curated structural dataset with the same structure prediction objectives as the original AlphaFold training, adapted to focus on accurate reproduction of the peptide-MHC binding geometry.
Evaluation against held-out MHC-peptide complexes uses Cα RMSD of predicted versus experimental peptide conformations as the primary accuracy metric, with additional assessment using pLDDT scores as a proxy for prediction confidence. The median peptide RMSD of 0.65 Å on the test set compares favorably to competing methods: Pandora, which uses homology modeling with templates from the structural database, and AlphaFold-Multimer, which is the standard approach for multi-chain complex prediction but lacks specialization for the peptide-groove interaction geometry. The improvement is most pronounced for peptides with unusual sequence motifs or for alleles with limited structural templates, where AlphaFold-Multimer's general training is insufficient to correctly place the peptide backbone.
MHC-Fine is directly applicable in computational immunology workflows focused on structural accuracy of MHC-peptide complexes. Vaccine designers modeling how specific peptides from pathogen proteins engage different HLA alleles in target populations can use MHC-Fine to generate higher-fidelity structural models than standard AlphaFold-Multimer provides. Cancer immunotherapy researchers identifying neoantigen candidates can use MHC-Fine predictions to assess structural plausibility of candidate peptides in patient-specific HLA alleles, complementing sequence-based affinity predictions. Structural biologists using computational models to guide experimental mutagenesis — identifying residues in the peptide or MHC allele that alter binding geometry — benefit from the improved peptide RMSD accuracy. For researchers studying the molecular basis of alloreactivity, transplant rejection, or autoimmune antigen presentation, MHC-Fine enables more reliable structural hypotheses about which peptide-MHC combinations are structurally compatible. The multi-species training also makes MHC-Fine useful for veterinary immunology research where non-human MHC systems are studied.
MHC-Fine represents a clear example of how domain-specific fine-tuning of a general-purpose structure predictor on a high-quality, task-relevant dataset can improve accuracy beyond what broad training achieves. The choice to build on OpenFold in PyTorch rather than the original JAX AlphaFold codebase is noteworthy as a practical contribution: it demonstrates that the OpenFold ecosystem is a viable platform for production-quality fine-tuning workflows, potentially lowering the barrier for future domain-specific AlphaFold adaptations. The 0.65 Å median peptide RMSD improvement over AlphaFold-Multimer, while modest in absolute terms, is meaningful for the MHC field where differences of fractions of an angstrom in anchor residue positioning can determine whether a peptide is presented or rejected. Limitations include the dataset size — 944 structures is sufficient for fine-tuning but may not capture the full diversity of the human HLA supertype landscape, and alleles with few or no crystal structures in the PDB will benefit less from the fine-tuning. The model also inherits AlphaFold's computational requirements and does not natively score binding affinities, so it must be combined with sequence-based affinity predictors for comprehensive peptide prioritization.