University of Naples Federico II / University of Bern
Three fixed ProtGPT2 fine-tunes specialized for metalloprotein generation, trained on ProteinMPNN-derived synthetic sequences.
sm_protgpt2 is a family of three fine-tuned protein language models that adapt the general-purpose generative model ProtGPT2 toward a specific functional class: metalloproteins, the roughly one-third of all proteins that depend on bound metal ions for catalysis, electron transfer, or structural stability. It was developed by Giulia Peteani and Thomas Lemmin at the University of Bern together with Gianmattia Sgueglia and Marco Chino at the University of Naples Federico II, and released as a bioRxiv preprint in May 2026.
The work addresses a recurring limitation of sequence-only generative protein models: while base ProtGPT2 samples broadly across natural protein space, its unconditional outputs only rarely reproduce the precise, geometrically constrained residue arrangements that coordinate metal ions. The authors' central idea is to steer generation using structure-derived synthetic data. Rather than fine-tuning on the limited and biased set of natural metalloprotein sequences, they use ProteinMPNN to generate large numbers of synthetic sequences consistent with experimentally determined metalloprotein backbones, then fine-tune ProtGPT2 on this curated synthetic corpus.
The result is three distributed checkpoints — alpha, beta, and gamma — that differ in the size and diversity of their fine-tuning sets (ranging from roughly 1,000 to 10,000 synthetic sequences). Each is a fixed model that generates metalloprotein-like sequences directly at inference, with no further training required by the user.
sm_protgpt2 inherits the architecture of ProtGPT2: a 738-million parameter, decoder-only (GPT-2 XL style) autoregressive transformer operating on a byte-pair-encoded amino-acid vocabulary. The contribution is the fine-tuning recipe rather than a new architecture. Starting from native metalloprotein structures, the authors apply ProteinMPNN inverse folding to sample synthetic sequences that fold to those backbones, building three training sets of increasing size and diversity. ProtGPT2 is then fine-tuned on each set to produce the alpha, beta, and gamma checkpoints. Generated sequences are evaluated for recovery of canonical metal-coordinating motifs, where the fine-tuned models reach 91% recovery versus 43% for the unmodified base model. Code and the synthetic training data are archived on Zenodo (10.5281/zenodo.18672158), and the three weight checkpoints are distributed on HuggingFace under the Apache 2.0 license. At the time of writing the HuggingFace model cards are empty placeholders, and no standalone data card accompanies the Zenodo deposit.
The models are aimed at researchers designing or engineering metalloproteins — including metalloenzymes for catalysis, electron-transfer proteins, and metal-sensing or metal-sequestering scaffolds — who want a fast, sequence-level source of candidates biased toward metal-binding competence. Because the checkpoints generate directly at inference, they fit naturally at the front of a design funnel: sample many sequences, then filter with structure prediction, docking, or experimental screening. The approach also serves as a template for specializing general protein language models toward other functional classes that are under-represented or geometrically constrained in natural sequence databases.
sm_protgpt2 illustrates a practical and generalizable strategy: using structure-based inverse folding to manufacture synthetic training data that redirects a pretrained sequence model toward a functional niche it otherwise samples poorly. The reported jump in metal-binding motif recovery (43% to 91%) is a concrete demonstration that targeted synthetic fine-tuning can substantially improve functional relevance without retraining a foundation model from scratch. As a 2026 preprint with openly licensed weights but minimal model documentation, its real-world adoption and experimental validation remain to be established; the headline metric is a sequence-level motif statistic rather than wet-lab confirmation of metal binding, so claims should be read as computational evidence pending further validation.
Peteani, G., et al. (2026) Structure-derived synthetic sequences guide a protein language model toward metalloproteins. bioRxiv.
DOI: 10.64898/2026.04.30.722007