Overview

ProLLaMA is a 7-billion-parameter protein large language model developed by PKU-YuanGroup at Peking University that addresses a persistent gap in the protein language modeling landscape: the tendency of existing models to specialize in either sequence understanding or sequence generation, but rarely both. By adapting the LLaMA-2 general-purpose language model to the protein domain through a two-stage continual training procedure, ProLLaMA achieves strong performance on both protein language understanding (PLU) and protein language generation (PLG) tasks within a single unified framework, releasing a preprint in February 2024 and subsequently published by IEEE Transactions on Artificial Intelligence in 2025.

The key architectural innovation is the Evolutionary Protein Generation Framework (EPGF), a test-time computation strategy that constrains the model's generative outputs to be biologically plausible. Standard autoregressive protein generation can produce statistically likely sequences that nonetheless violate physical and evolutionary constraints; EPGF addresses this by combining a multi-dimensional scorer, a hierarchical decoding strategy, and a probabilistic-biophysical joint selection mechanism that collectively guide sampling toward sequences with favorable structural and functional properties.

ProLLaMA is trained on a large instruction dataset containing approximately 13 million samples spanning over 11,000 protein superfamily annotations drawn from established classification databases. This breadth of functional annotation enables the model to condition generation on specific superfamily designations — a capability useful for targeted protein design — while the same representations underpin its classification performance.

Key Features

Unified understanding and generation: Unlike models that specialize in one direction, ProLLaMA achieves competitive performance on both superfamily classification and controllable sequence generation, enabling researchers to use a single model for diverse protein language tasks.
Evolutionary Protein Generation Framework (EPGF): A post-hoc generation refinement strategy incorporating a multi-dimensional interpretable scorer, hierarchical efficient decoding, and a probabilistic-biophysical joint selection mechanism that improves the biological quality of generated sequences at inference time.
Superfamily-conditioned generation: Accepts natural-language-style instructions specifying a target superfamily (e.g., [Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily>), enabling directed generation of sequences with prescribed structural and functional characteristics.
Instruction-following interface: A prompt-based API mirrors the natural language interfaces of general-purpose LLMs, lowering the barrier for biologists unfamiliar with traditional protein design pipelines.
Large-scale instruction tuning: Stage 2 training on ~13 million instruction samples with 11,000+ superfamily labels provides broad coverage of protein sequence-function space, enabling generalization across diverse protein families.

Technical Details

ProLLaMA is initialized from LLaMA-2-7B, a 7-billion-parameter autoregressive transformer, and trained in two stages. In Stage 1, the model undergoes continual pretraining on the UniRef50 database of non-redundant protein sequences, adapting the general language model's tokenization and representations to the amino acid alphabet while retaining natural language capabilities. In Stage 2, instruction tuning on approximately 13 million multi-task instruction samples covering superfamily prediction and sequence generation instills task-specific behavior; training uses bfloat16 precision with CUDA 11.7. The model is served via the HuggingFace Transformers library and employs a generation configuration with temperature 0.2, top-k 40, and top-p 0.9 to balance diversity and coherence.

On benchmarks, ProLLaMA achieves a 67.1% exact match rate on superfamily classification tasks. The EPGF generation strategy yields a +4.3% improvement in biophysical quality scores and a +14.5% improvement in structural metrics relative to baseline autoregressive decoding, demonstrating that the post-hoc biological constraint mechanism meaningfully improves output quality beyond what the base language model produces on its own. Model weights for both the Stage 1 pretrained variant (ProLLaMA_Stage_1) and the full instruction-tuned model are released on the HuggingFace Hub under an Apache 2.0 license.

Applications

ProLLaMA is well-suited for researchers who need a single model to alternate between protein function annotation and sequence design. Computational biologists can use its classification interface to assign superfamily labels to novel sequences from metagenomic datasets or directed evolution experiments. Protein engineers can leverage the conditional generation interface to produce candidate sequences constrained to a target structural scaffold or functional family, then filter outputs using downstream structure prediction tools such as ESMFold or AlphaFold 2. The instruction-following format also makes ProLLaMA accessible to wet-lab scientists who are more comfortable with natural-language queries than with bespoke bioinformatics scripting.

Impact

ProLLaMA demonstrates that general-purpose large language model architectures can be effectively adapted to the protein domain with relatively straightforward continual training, and that instruction tuning — a technique standard in NLP — transfers meaningfully to protein sequence tasks. Its unified PLU/PLG design anticipates the direction the field has taken toward more general protein foundation models capable of multi-task reasoning. A key limitation is that the model operates purely at the sequence level and does not model 3D structure directly; generated sequences must be validated by separate structure prediction tools to assess folding plausibility. The 7B parameter scale also places it at the lower end of modern LLM capacity, and future work may explore whether larger models or richer structural supervision further improve generation quality and functional accuracy.

Citation

ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Preprint

Lv, L., Lin, Z., Li, H., Liu, Y., Cui, J., Chen, C. Y. C., Yuan, L., & Tian, Y. (2024). ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing. IEEE Transactions on Artificial Intelligence.

DOI: 10.48550/arXiv.2402.16445

Overview

Key Features

Unified understanding and generation: Unlike models that specialize in one direction, ProLLaMA achieves competitive performance on both superfamily classification and controllable sequence generation, enabling researchers to use a single model for diverse protein language tasks.

Evolutionary Protein Generation Framework (EPGF): A post-hoc generation refinement strategy incorporating a multi-dimensional interpretable scorer, hierarchical efficient decoding, and a probabilistic-biophysical joint selection mechanism that improves the biological quality of generated sequences at inference time.

Superfamily-conditioned generation: Accepts natural-language-style instructions specifying a target superfamily (e.g., [Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily>), enabling directed generation of sequences with prescribed structural and functional characteristics.

Instruction-following interface: A prompt-based API mirrors the natural language interfaces of general-purpose LLMs, lowering the barrier for biologists unfamiliar with traditional protein design pipelines.

Large-scale instruction tuning: Stage 2 training on ~13 million instruction samples with 11,000+ superfamily labels provides broad coverage of protein sequence-function space, enabling generalization across diverse protein families.

Technical Details

Applications

Impact

Citation

ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Preprint

DOI: 10.48550/arXiv.2402.16445

ProLLaMA

Overview

Key Features

Technical Details

Applications

Impact

Citation

ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Metrics

GitHub

HuggingFace

Tags

Resources

ProLLaMA

Overview

Key Features

Technical Details

Applications

Impact

Citation

ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

Metrics

GitHub

HuggingFace

Tags

Resources