bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

ProFam

University College London / Technical University of Munich

A 251M-parameter autoregressive protein-family language model for zero-shot variant fitness prediction and homology-guided protein design, with fully open code and data.

Released: December 2025
Parameters: 251 Million

ProFam is an open-source family of protein-family language models (pfLMs) for zero-shot variant fitness prediction and homology-guided protein design, introduced in a December 2025 bioRxiv preprint by Wells, Hawkins-Hooker, Bordin, Orengo, Paige and colleagues at University College London together with Heinzinger, Rost and Dallago at the Technical University of Munich. Its flagship checkpoint, ProFam-1, is a 251-million-parameter autoregressive Transformer.

Most protein language models are trained on individual sequences (like ESM) or on pre-computed multiple sequence alignments (MSAs) at inference time (like the MSA Transformer or EVE). ProFam takes a different route: it performs next-token prediction across concatenated, unaligned sets of homologous sequences drawn from the same protein family. This lets the model learn evolutionary couplings and conservation patterns directly from raw homologs, without the computational cost of building MSAs at inference. A query sequence can be scored or generated conditioned on a context of related sequences, blending the strengths of single-sequence and alignment-based approaches.

The work is positioned as a fully reproducible, openly licensed alternative to closed protein-design systems: the model weights, training and inference pipelines, and the large-scale training corpus (ProFam Atlas) are all released publicly, lowering the barrier for academic groups to train and adapt family-conditioned models.

#Key Features

  • Family-conditioned autoregressive modeling: ProFam predicts the next token across concatenated, unaligned sequences from a protein family, capturing covariance and conservation without requiring an MSA at inference time.
  • Zero-shot fitness prediction: On the ProteinGym benchmark, ProFam-1 is competitive with state-of-the-art models, reaching Spearman correlations of approximately 0.47 for substitutions and 0.53 for indels.
  • Homology-guided generation: Conditioned on family context, the model generates diverse sequences with high predicted structural similarity while preserving residue conservation and covariation patterns.
  • Fully open stack: Weights (Hugging Face), training and inference code (GitHub, MIT license), and the ProFam Atlas dataset (Zenodo, CC-BY-4.0) are all released for reproducible research.
  • Variant scoring via log-likelihood: Mutations and indels are scored using model likelihoods relative to family context, enabling both single-point and insertion/deletion effect prediction.

#Technical Details

ProFam-1 is a 251M-parameter decoder-only Transformer trained with a protein-family language modeling (pfLM) objective over the ProFam Atlas corpus. ProFam Atlas bundles roughly 40 million protein families (~481 million sequences, ~84.6 GB) assembled from four complementary sources: ~2.3M FoldSeek structure-based families (~30M sequences) from AlphaFold DB clusters, ~37M MSA-based families (~246M sequences) from OpenProteinSet/OpenFold, ~38,000 CATH FunFam functional domain families (~765K sequences), and ~205M UniRef90 singleton sequences. Families are presented as concatenated, unaligned sequence sets so the model learns evolutionary structure end-to-end. On ProteinGym, ProFam-1 attains Spearman ρ ≈ 0.47 (substitutions) and ≈ 0.53 (indels), placing it among competitive zero-shot fitness predictors despite its modest parameter count. The released code supports both CPU-only and FlashAttention-2 inference, with a Python API and CLI for scoring and generation.

#Applications

ProFam supports protein engineers and computational biologists who need to rank candidate mutations or design new variants without wet-lab screening. Its zero-shot fitness scoring is useful for prioritizing substitutions and indels in directed-evolution campaigns, while homology-guided generation enables proposing diverse, structurally plausible sequences that respect a family's evolutionary constraints. Because the full training stack and ProFam Atlas dataset are open, the model also serves as a practical foundation for researchers wanting to retrain or fine-tune family-conditioned models on their own protein families or tasks.

#Impact

ProFam contributes a reproducible, openly licensed pathway into family-conditioned protein modeling, an area otherwise dominated by single-sequence and closed-source systems. By matching competitive ProteinGym performance with a relatively small 251M-parameter model while releasing weights, code, and a curated ~481M-sequence training corpus, it lowers the entry cost for academic protein-design research and provides a transparent baseline for studying how homologous context improves fitness prediction and generation. As a recent preprint, its long-term adoption and independent benchmarking remain to be established, and the released Hugging Face model card is currently minimal.

Citation

ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design

Wells, J., et al. (2025) ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design. bioRxiv.

DOI: 10.64898/2025.12.19.695431

Recent citations

Papers that recently cited this model.

  • Multi-modal feature learning to prioritize ADCs with favorable half-life in mice

    Xiang-Wei Zhu, Khushboo A. Jani, Wenjia Gu, et al.

    mAbs · Apr 2026

    0

Top citations

The most-cited papers that cite this model.

  • Multi-modal feature learning to prioritize ADCs with favorable half-life in mice

    Xiang-Wei Zhu, Khushboo A. Jani, Wenjia Gu, et al.

    mAbs · Apr 2026

    0

Citations

Total Citations1
Influential0
References53

GitHub

Stars58
Forks14
Open Issues8
Contributors7
Last Push15d ago
LanguagePython
LicenseMIT

HuggingFace

Downloads0
Likes2
Last Modified3mo ago

Fields of citing research

  • Medicine100%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible
86Open
Usability — can I run it?91
Reproducibility — can I retrain it?85
Model Openness Framework
Unclassified
Missing required components

Tags

de_novo_designgenerativelanguage_modelprotein_designproteomicstransformervariant_effect_predictionzero_shot

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDemoDataset