Overview

ReprogBERT is a computational antibody design system developed at IBM that uses model reprogramming to adapt a pretrained English BERT model for antibody sequence infilling — without modifying the base model's parameters at all. Rather than training a dedicated protein language model from scratch, the approach introduces a thin learnable layer that translates between the linguistic representations encoded in English BERT and the amino acid space of antibody sequences. This cross-domain transfer strategy allows effective generation of complementarity-determining region (CDR) sequences with comparatively little domain-specific training data.

The core problem ReprogBERT addresses is CDR sequence design. CDRs are the hypervariable loops that determine antibody binding specificity, and generating diverse yet structurally valid CDR variants is a central challenge in therapeutic antibody engineering. Most generative approaches either require large volumes of labeled protein data or depend on computationally expensive structure-based methods. ReprogBERT sidesteps both constraints by repurposing the rich contextual representations already present in a general-purpose language model, bridging language and protein domains through learned projection matrices rather than retraining.

The method was published at the International Conference on Machine Learning (ICML) 2023, where it demonstrated that model reprogramming — a technique more often applied in audio and image domains — could be effective in the protein sequence context.

Key Features

Frozen backbone: The English BERT model is kept entirely frozen during training; only two small projection matrices and a set of amino acid embeddings are learned, drastically reducing the number of trainable parameters.
Cross-domain projection: Theta and gamma projection matrices map amino acid embeddings into BERT's English embedding space for the forward pass, then map BERT's output representations back into the protein domain, enabling the base model to reason over protein sequences without being retrained.
Sequence diversity: ReprogBERT generates CDR sequences with more than a two-fold increase in diversity compared to baseline models while maintaining structural validity and low perplexity as measured by ProGen2.
Structure-free generation: The model operates on sequences alone, requiring no protein structure information during inference, which makes it fast and broadly applicable even for targets lacking experimental structures.
Low-resource adaptability: Cross-language knowledge transfer allows the model to perform well on antibody infilling tasks even when domain-specific training data is scarce.

Technical Details

ReprogBERT is built on a pretrained English BERT encoder whose weights remain fixed throughout training. Two learnable projection matrices are introduced at the embedding and output layers: the theta matrix projects amino acid embeddings into BERT's English token embedding space, while the gamma matrix projects BERT's output hidden states back into the amino acid probability space. The 20 standard amino acid embeddings are also learned from scratch. This architecture means the vast majority of parameters in the system are frozen, with training focused entirely on the three small learnable components.

The model is trained on antibody sequences from the Structural Antibody Database (SAbDab) using a masked infilling objective in which CDR residues are masked and must be predicted from surrounding heavy- and light-chain context. Evaluation compares ReprogBERT against ProtBERT (a protein-pretrained BERT model fine-tuned on the same task) and a variant of English BERT where word embeddings are replaced with amino acid embeddings directly. ReprogBERT is assessed on amino acid recovery rate (AAR), sequence diversity (DIV), and sequence naturalness as measured by ProGen2 perplexity, as well as computational structural consistency via structure prediction tools.

Applications

ReprogBERT is most directly applicable to therapeutic antibody design, where the ability to generate structurally diverse CDR variants from limited training data is practically valuable. Research teams can use the model for in silico CDR diversification and optimization, rapidly generating large panels of candidate sequences for downstream experimental screening. The sequence-only design and low computational overhead make ReprogBERT well suited for high-throughput campaigns against novel therapeutic targets, including emerging infectious disease antigens where annotated sequence data may be sparse. The model has also been applied to virus neutralization studies, where CDR sequences with enhanced neutralization capability can be generated and prioritized before wet-lab validation.

Impact

ReprogBERT demonstrates that model reprogramming — an idea primarily developed in computer vision and speech — is a viable strategy for protein sequence generation, broadening the toolkit available to computational biologists who work with limited domain-specific data. By keeping the base model frozen and learning only a small cross-domain interface, the approach substantially reduces the data and compute requirements compared to training a protein language model from scratch. The method's ICML 2023 publication helped establish model reprogramming as a direction worth exploring in the protein ML community. A notable limitation is that ReprogBERT is a sequence-only model and does not explicitly optimize for structural or biophysical properties such as thermostability or expression yield; generated sequences still require experimental validation to confirm binding affinity and functional activity.

Citation

Reprogramming Pretrained Language Models for Antibody Sequence Infilling

Preprint

Melnyk, I., Chenthamarakshan, V., Chen, P. Y., Das, P., Dhurandhar, A., Padhi, I., & Das, D. (2023). Reprogramming Pretrained Language Models for Antibody Sequence Infilling. In International Conference on Machine Learning (ICML 2023).

DOI: 10.48550/arXiv.2210.07144

Overview

Key Features

Frozen backbone: The English BERT model is kept entirely frozen during training; only two small projection matrices and a set of amino acid embeddings are learned, drastically reducing the number of trainable parameters.

Cross-domain projection: Theta and gamma projection matrices map amino acid embeddings into BERT's English embedding space for the forward pass, then map BERT's output representations back into the protein domain, enabling the base model to reason over protein sequences without being retrained.

Sequence diversity: ReprogBERT generates CDR sequences with more than a two-fold increase in diversity compared to baseline models while maintaining structural validity and low perplexity as measured by ProGen2.

Structure-free generation: The model operates on sequences alone, requiring no protein structure information during inference, which makes it fast and broadly applicable even for targets lacking experimental structures.

Low-resource adaptability: Cross-language knowledge transfer allows the model to perform well on antibody infilling tasks even when domain-specific training data is scarce.

Technical Details

Applications

Impact

Citation

Reprogramming Pretrained Language Models for Antibody Sequence Infilling

Preprint

DOI: 10.48550/arXiv.2210.07144

ReprogBERT

Overview

Key Features

Technical Details

Applications

Impact

Citation

Reprogramming Pretrained Language Models for Antibody Sequence Infilling

Metrics

GitHub

Citations

Tags

Resources

ReprogBERT

Overview

Key Features

Technical Details

Applications

Impact

Citation

Reprogramming Pretrained Language Models for Antibody Sequence Infilling

Metrics

GitHub

Citations

Tags

Resources