ProFam

University College London / Technical University of Munich

Protein-family language model trained on unaligned homolog sets for zero-shot variant fitness prediction and design. ProFam-1 holds 251M parameters.

Released: December 2025

Parameters: 251 Million

ProFam is an open-source family of protein-family language models (pfLMs) for zero-shot variant fitness prediction and homology-guided protein design, introduced in a December 2025 bioRxiv preprint by Wells, Hawkins-Hooker, Bordin, Orengo, Paige and colleagues at University College London together with Heinzinger, Rost and Dallago at the Technical University of Munich. Its flagship checkpoint, ProFam-1, is a 251-million-parameter autoregressive Transformer.

Most protein language models are trained on individual sequences (like ESM) or on pre-computed multiple sequence alignments (MSAs) at inference time (like the MSA Transformer or EVE). ProFam takes a different route: it performs next-token prediction across concatenated, unaligned sets of homologous sequences drawn from the same protein family. This lets the model learn evolutionary couplings and conservation patterns directly from raw homologs, without the computational cost of building MSAs at inference. A query sequence can be scored or generated conditioned on a context of related sequences, blending the strengths of single-sequence and alignment-based approaches.

The work is positioned as a fully reproducible, openly licensed alternative to closed protein-design systems: the model weights, training and inference pipelines, and the large-scale training corpus (ProFam Atlas) are all released publicly, lowering the barrier for academic groups to train and adapt family-conditioned models.

Key Features

Family-conditioned autoregressive modeling: ProFam predicts the next token across concatenated, unaligned sequences from a protein family, capturing covariance and conservation without requiring an MSA at inference time.
Zero-shot fitness prediction: On the ProteinGym benchmark, ProFam-1 is competitive with state-of-the-art models, reaching Spearman correlations of approximately 0.47 for substitutions and 0.53 for indels.
Homology-guided generation: Conditioned on family context, the model generates diverse sequences with high predicted structural similarity while preserving residue conservation and covariation patterns.
Fully open stack: Weights (Hugging Face), training and inference code (GitHub, MIT license), and the ProFam Atlas dataset (Zenodo, CC-BY-4.0) are all released for reproducible research.
Variant scoring via log-likelihood: Mutations and indels are scored using model likelihoods relative to family context, enabling both single-point and insertion/deletion effect prediction.

Technical Details

ProFam-1 is a 251M-parameter decoder-only Transformer trained with a protein-family language modeling (pfLM) objective over the ProFam Atlas corpus. ProFam Atlas bundles roughly 40 million protein families (~481 million sequences, ~84.6 GB) assembled from four complementary sources: ~2.3M FoldSeek structure-based families (~30M sequences) from AlphaFold DB clusters, ~37M MSA-based families (~246M sequences) from OpenProteinSet/OpenFold, ~38,000 CATH FunFam functional domain families (~765K sequences), and ~205M UniRef90 singleton sequences. Families are presented as concatenated, unaligned sequence sets so the model learns evolutionary structure end-to-end. On ProteinGym, ProFam-1 attains Spearman ρ ≈ 0.47 (substitutions) and ≈ 0.53 (indels), placing it among competitive zero-shot fitness predictors despite its modest parameter count. The released code supports both CPU-only and FlashAttention-2 inference, with a Python API and CLI for scoring and generation.

Applications

ProFam supports protein engineers and computational biologists who need to rank candidate mutations or design new variants without wet-lab screening. Its zero-shot fitness scoring is useful for prioritizing substitutions and indels in directed-evolution campaigns, while homology-guided generation enables proposing diverse, structurally plausible sequences that respect a family's evolutionary constraints. Because the full training stack and ProFam Atlas dataset are open, the model also serves as a practical foundation for researchers wanting to retrain or fine-tune family-conditioned models on their own protein families or tasks.

Impact

ProFam contributes a reproducible, openly licensed pathway into family-conditioned protein modeling, an area otherwise dominated by single-sequence and closed-source systems. By matching competitive ProteinGym performance with a relatively small 251M-parameter model while releasing weights, code, and a curated ~481M-sequence training corpus, it lowers the entry cost for academic protein-design research and provides a transparent baseline for studying how homologous context improves fitness prediction and generation. As a recent preprint, its long-term adoption and independent benchmarking remain to be established, and the released Hugging Face model card is currently minimal.

Citation

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Wells, J., et al. (2025) ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design. bioRxiv.

DOI: 10.64898/2025.12.19.695431

Recent citations

Papers that recently cited this model.

ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
Michele Garibbo, Gerard Boxó, Filippo Stocco, et al.
bioRxiv · Jun 2026
0Influential
Multi-modal feature learning to prioritize ADCs with favorable half-life in mice
Xiang-Wei Zhu, Khushboo A. Jani, Wenjia Gu, et al.
mAbs · Apr 2026
0

Top citations

The most-cited papers that cite this model.

Multi-modal feature learning to prioritize ADCs with favorable half-life in mice
Xiang-Wei Zhu, Khushboo A. Jani, Wenjia Gu, et al.
mAbs · Apr 2026
0
ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models
Michele Garibbo, Gerard Boxó, Filippo Stocco, et al.
bioRxiv · Jun 2026
0Influential

Citations

Total Citations2

Influential1

References53

GitHub

Stars58

Forks15

Open Issues11

Contributors7

Last Push2mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes2

Last Modified4mo ago

Fields of citing research

Biology50%
Computer Science50%
Medicine50%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

86Open

Usability — can I run it?91

Reproducibility — can I retrain it?85

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model Demo Dataset

Key Features

Family-conditioned autoregressive modeling: ProFam predicts the next token across concatenated, unaligned sequences from a protein family, capturing covariance and conservation without requiring an MSA at inference time.

Zero-shot fitness prediction: On the ProteinGym benchmark, ProFam-1 is competitive with state-of-the-art models, reaching Spearman correlations of approximately 0.47 for substitutions and 0.53 for indels.

Homology-guided generation: Conditioned on family context, the model generates diverse sequences with high predicted structural similarity while preserving residue conservation and covariation patterns.

Fully open stack: Weights (Hugging Face), training and inference code (GitHub, MIT license), and the ProFam Atlas dataset (Zenodo, CC-BY-4.0) are all released for reproducible research.

Variant scoring via log-likelihood: Mutations and indels are scored using model likelihoods relative to family context, enabling both single-point and insertion/deletion effect prediction.

Technical Details

Applications

Impact

ProFam

Key Features

Technical Details

Applications

Impact

Citation

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Recent citations

ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models

Multi-modal feature learning to prioritize ADCs with favorable half-life in mice

Top citations

Multi-modal feature learning to prioritize ADCs with favorable half-life in mice

ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProFam

Key Features

Technical Details

Applications

Impact

Citation

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Recent citations

ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models

Multi-modal feature learning to prioritize ADCs with favorable half-life in mice

Top citations

Multi-modal feature learning to prioritize ADCs with favorable half-life in mice

ProtGPT3: an Open-source family of Promptable and Aligned Protein Language Models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProFam

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProFam

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact