sm_protgpt2

University of Naples Federico II / University of Bern

Three fixed ProtGPT2 fine-tunes specialized for metalloprotein generation, trained on ProteinMPNN-derived synthetic sequences.

Released: May 2026

sm_protgpt2 is a family of three fine-tuned protein language models that adapt the general-purpose generative model ProtGPT2 toward a specific functional class: metalloproteins, the roughly one-third of all proteins that depend on bound metal ions for catalysis, electron transfer, or structural stability. It was developed by Giulia Peteani and Thomas Lemmin at the University of Bern together with Gianmattia Sgueglia and Marco Chino at the University of Naples Federico II, and released as a bioRxiv preprint in May 2026.

The work addresses a recurring limitation of sequence-only generative protein models: while base ProtGPT2 samples broadly across natural protein space, its unconditional outputs only rarely reproduce the precise, geometrically constrained residue arrangements that coordinate metal ions. The authors' central idea is to steer generation using structure-derived synthetic data. Rather than fine-tuning on the limited and biased set of natural metalloprotein sequences, they use ProteinMPNN to generate large numbers of synthetic sequences consistent with experimentally determined metalloprotein backbones, then fine-tune ProtGPT2 on this curated synthetic corpus.

The result is three distributed checkpoints — alpha, beta, and gamma — that differ in the size and diversity of their fine-tuning sets (ranging from roughly 1,000 to 10,000 synthetic sequences). Each is a fixed model that generates metalloprotein-like sequences directly at inference, with no further training required by the user.

Key Features

Metalloprotein-specialized generation: All three variants are tuned to emit sequences enriched for canonical metal-binding motifs, dramatically increasing the yield of plausible metalloprotein candidates relative to unconditional ProtGPT2 sampling.
Structure-derived synthetic training data: Fine-tuning sequences are generated by ProteinMPNN from real metalloprotein backbones, transferring structural geometry into a sequence-only model without requiring native metalloprotein sequence collections.
Three ready-to-use checkpoints: The alpha, beta, and gamma variants span a range of training-set sizes and diversities (~1,000–10,000 sequences), letting users trade off specialization against sequence diversity.
No user re-training required: Each model is a fixed, downloadable checkpoint that produces metalloprotein-like sequences at inference time, lowering the barrier for non-specialist labs.
Large gains in motif recovery: Canonical metal-binding motif recovery rises from 43% with base ProtGPT2 to 91% with the fine-tuned models, the headline quantitative result of the study.

Technical Details

sm_protgpt2 inherits the architecture of ProtGPT2: a 738-million parameter, decoder-only (GPT-2 XL style) autoregressive transformer operating on a byte-pair-encoded amino-acid vocabulary. The contribution is the fine-tuning recipe rather than a new architecture. Starting from native metalloprotein structures, the authors apply ProteinMPNN inverse folding to sample synthetic sequences that fold to those backbones, building three training sets of increasing size and diversity. ProtGPT2 is then fine-tuned on each set to produce the alpha, beta, and gamma checkpoints. Generated sequences are evaluated for recovery of canonical metal-coordinating motifs, where the fine-tuned models reach 91% recovery versus 43% for the unmodified base model. Code and the synthetic training data are archived on Zenodo (10.5281/zenodo.18672158), and the three weight checkpoints are distributed on HuggingFace under the Apache 2.0 license. At the time of writing the HuggingFace model cards are empty placeholders, and no standalone data card accompanies the Zenodo deposit.

Applications

The models are aimed at researchers designing or engineering metalloproteins — including metalloenzymes for catalysis, electron-transfer proteins, and metal-sensing or metal-sequestering scaffolds — who want a fast, sequence-level source of candidates biased toward metal-binding competence. Because the checkpoints generate directly at inference, they fit naturally at the front of a design funnel: sample many sequences, then filter with structure prediction, docking, or experimental screening. The approach also serves as a template for specializing general protein language models toward other functional classes that are under-represented or geometrically constrained in natural sequence databases.

Impact

sm_protgpt2 illustrates a practical and generalizable strategy: using structure-based inverse folding to manufacture synthetic training data that redirects a pretrained sequence model toward a functional niche it otherwise samples poorly. The reported jump in metal-binding motif recovery (43% to 91%) is a concrete demonstration that targeted synthetic fine-tuning can substantially improve functional relevance without retraining a foundation model from scratch. As a 2026 preprint with openly licensed weights but minimal model documentation, its real-world adoption and experimental validation remain to be established; the headline metric is a sequence-level motif statistic rather than wet-lab confirmation of metal binding, so claims should be read as computational evidence pending further validation.

Citation

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Peteani, G., et al. (2026) Structure-derived synthetic sequences guide a protein language model toward metalloproteins. bioRxiv.

DOI: 10.64898/2026.04.30.722007

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References0

HuggingFace

Downloads2

Likes0

Last Modified2y ago

Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

38Closed

Usability — can I run it?51

Reproducibility — can I retrain it?30

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper HuggingFace Model HuggingFace Model HuggingFace Model Dataset

Key Features

Metalloprotein-specialized generation: All three variants are tuned to emit sequences enriched for canonical metal-binding motifs, dramatically increasing the yield of plausible metalloprotein candidates relative to unconditional ProtGPT2 sampling.

Structure-derived synthetic training data: Fine-tuning sequences are generated by ProteinMPNN from real metalloprotein backbones, transferring structural geometry into a sequence-only model without requiring native metalloprotein sequence collections.

Three ready-to-use checkpoints: The alpha, beta, and gamma variants span a range of training-set sizes and diversities (~1,000–10,000 sequences), letting users trade off specialization against sequence diversity.

No user re-training required: Each model is a fixed, downloadable checkpoint that produces metalloprotein-like sequences at inference time, lowering the barrier for non-specialist labs.

Large gains in motif recovery: Canonical metal-binding motif recovery rises from 43% with base ProtGPT2 to 91% with the fine-tuned models, the headline quantitative result of the study.

Technical Details

Applications

Impact

sm_protgpt2

Key Features

Technical Details

Applications

Impact

Citation

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Recent citations

Top citations

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

sm_protgpt2

Key Features

Technical Details

Applications

Impact

Citation

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Recent citations

Top citations

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

sm_protgpt2

#Key Features

#Technical Details

#Applications

#Impact

Citation

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Recent citations

Top citations

Related models

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

sm_protgpt2

#Key Features

#Technical Details

#Applications

#Impact

Citation

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Recent citations

Top citations

Related models

Citations

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact