ReprogBERT is a computational antibody design system developed at IBM that uses model reprogramming to adapt a pretrained English BERT model for antibody sequence infilling — without modifying the base model's parameters at all. Rather than training a dedicated protein language model from scratch, the approach introduces a thin learnable layer that translates between the linguistic representations encoded in English BERT and the amino acid space of antibody sequences. This cross-domain transfer strategy allows effective generation of complementarity-determining region (CDR) sequences with comparatively little domain-specific training data.
The core problem ReprogBERT addresses is CDR sequence design. CDRs are the hypervariable loops that determine antibody binding specificity, and generating diverse yet structurally valid CDR variants is a central challenge in therapeutic antibody engineering. Most generative approaches either require large volumes of labeled protein data or depend on computationally expensive structure-based methods. ReprogBERT sidesteps both constraints by repurposing the rich contextual representations already present in a general-purpose language model, bridging language and protein domains through learned projection matrices rather than retraining.
The method was published at the International Conference on Machine Learning (ICML) 2023, where it demonstrated that model reprogramming — a technique more often applied in audio and image domains — could be effective in the protein sequence context.
ReprogBERT is built on a pretrained English BERT encoder whose weights remain fixed throughout training. Two learnable projection matrices are introduced at the embedding and output layers: the theta matrix projects amino acid embeddings into BERT's English token embedding space, while the gamma matrix projects BERT's output hidden states back into the amino acid probability space. The 20 standard amino acid embeddings are also learned from scratch. This architecture means the vast majority of parameters in the system are frozen, with training focused entirely on the three small learnable components.
The model is trained on antibody sequences from the Structural Antibody Database (SAbDab) using a masked infilling objective in which CDR residues are masked and must be predicted from surrounding heavy- and light-chain context. Evaluation compares ReprogBERT against ProtBERT (a protein-pretrained BERT model fine-tuned on the same task) and a variant of English BERT where word embeddings are replaced with amino acid embeddings directly. ReprogBERT is assessed on amino acid recovery rate (AAR), sequence diversity (DIV), and sequence naturalness as measured by ProGen2 perplexity, as well as computational structural consistency via structure prediction tools.
ReprogBERT is most directly applicable to therapeutic antibody design, where the ability to generate structurally diverse CDR variants from limited training data is practically valuable. Research teams can use the model for in silico CDR diversification and optimization, rapidly generating large panels of candidate sequences for downstream experimental screening. The sequence-only design and low computational overhead make ReprogBERT well suited for high-throughput campaigns against novel therapeutic targets, including emerging infectious disease antigens where annotated sequence data may be sparse. The model has also been applied to virus neutralization studies, where CDR sequences with enhanced neutralization capability can be generated and prioritized before wet-lab validation.
ReprogBERT demonstrates that model reprogramming — an idea primarily developed in computer vision and speech — is a viable strategy for protein sequence generation, broadening the toolkit available to computational biologists who work with limited domain-specific data. By keeping the base model frozen and learning only a small cross-domain interface, the approach substantially reduces the data and compute requirements compared to training a protein language model from scratch. The method's ICML 2023 publication helped establish model reprogramming as a direction worth exploring in the protein ML community. A notable limitation is that ReprogBERT is a sequence-only model and does not explicitly optimize for structural or biophysical properties such as thermostability or expression yield; generated sequences still require experimental validation to confirm binding affinity and functional activity.
Melnyk, I., Chenthamarakshan, V., Chen, P. Y., Das, P., Dhurandhar, A., Padhi, I., & Das, D. (2023). Reprogramming Pretrained Language Models for Antibody Sequence Infilling. In International Conference on Machine Learning (ICML 2023).
DOI: 10.48550/arXiv.2210.07144