University College London / Technical University of Munich
A 251M-parameter autoregressive protein-family language model for zero-shot variant fitness prediction and homology-guided protein design, with fully open code and data.
ProFam is an open-source family of protein-family language models (pfLMs) for zero-shot variant fitness prediction and homology-guided protein design, introduced in a December 2025 bioRxiv preprint by Wells, Hawkins-Hooker, Bordin, Orengo, Paige and colleagues at University College London together with Heinzinger, Rost and Dallago at the Technical University of Munich. Its flagship checkpoint, ProFam-1, is a 251-million-parameter autoregressive Transformer.
Most protein language models are trained on individual sequences (like ESM) or on pre-computed multiple sequence alignments (MSAs) at inference time (like the MSA Transformer or EVE). ProFam takes a different route: it performs next-token prediction across concatenated, unaligned sets of homologous sequences drawn from the same protein family. This lets the model learn evolutionary couplings and conservation patterns directly from raw homologs, without the computational cost of building MSAs at inference. A query sequence can be scored or generated conditioned on a context of related sequences, blending the strengths of single-sequence and alignment-based approaches.
The work is positioned as a fully reproducible, openly licensed alternative to closed protein-design systems: the model weights, training and inference pipelines, and the large-scale training corpus (ProFam Atlas) are all released publicly, lowering the barrier for academic groups to train and adapt family-conditioned models.
ProFam-1 is a 251M-parameter decoder-only Transformer trained with a protein-family language modeling (pfLM) objective over the ProFam Atlas corpus. ProFam Atlas bundles roughly 40 million protein families (~481 million sequences, ~84.6 GB) assembled from four complementary sources: ~2.3M FoldSeek structure-based families (~30M sequences) from AlphaFold DB clusters, ~37M MSA-based families (~246M sequences) from OpenProteinSet/OpenFold, ~38,000 CATH FunFam functional domain families (~765K sequences), and ~205M UniRef90 singleton sequences. Families are presented as concatenated, unaligned sequence sets so the model learns evolutionary structure end-to-end. On ProteinGym, ProFam-1 attains Spearman ρ ≈ 0.47 (substitutions) and ≈ 0.53 (indels), placing it among competitive zero-shot fitness predictors despite its modest parameter count. The released code supports both CPU-only and FlashAttention-2 inference, with a Python API and CLI for scoring and generation.
ProFam supports protein engineers and computational biologists who need to rank candidate mutations or design new variants without wet-lab screening. Its zero-shot fitness scoring is useful for prioritizing substitutions and indels in directed-evolution campaigns, while homology-guided generation enables proposing diverse, structurally plausible sequences that respect a family's evolutionary constraints. Because the full training stack and ProFam Atlas dataset are open, the model also serves as a practical foundation for researchers wanting to retrain or fine-tune family-conditioned models on their own protein families or tasks.
ProFam contributes a reproducible, openly licensed pathway into family-conditioned protein modeling, an area otherwise dominated by single-sequence and closed-source systems. By matching competitive ProteinGym performance with a relatively small 251M-parameter model while releasing weights, code, and a curated ~481M-sequence training corpus, it lowers the entry cost for academic protein-design research and provides a transparent baseline for studying how homologous context improves fitness prediction and generation. As a recent preprint, its long-term adoption and independent benchmarking remain to be established, and the released Hugging Face model card is currently minimal.
Wells, J., et al. (2025) ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design. bioRxiv.
DOI: 10.64898/2025.12.19.695431Papers that recently cited this model.
Xiang-Wei Zhu, Khushboo A. Jani, Wenjia Gu, et al.
mAbs · Apr 2026
The most-cited papers that cite this model.
Xiang-Wei Zhu, Khushboo A. Jani, Wenjia Gu, et al.
mAbs · Apr 2026
Share of papers citing this model.