A 7B-parameter protein language model built on LLaMA-2 that performs both protein sequence generation and superfamily classification in a unified framework.
ProLLaMA is a 7-billion-parameter protein large language model developed by PKU-YuanGroup at Peking University that addresses a persistent gap in the protein language modeling landscape: the tendency of existing models to specialize in either sequence understanding or sequence generation, but rarely both. By adapting the LLaMA-2 general-purpose language model to the protein domain through a two-stage continual training procedure, ProLLaMA achieves strong performance on both protein language understanding (PLU) and protein language generation (PLG) tasks within a single unified framework, releasing a preprint in February 2024 and subsequently published by IEEE Transactions on Artificial Intelligence in 2025.
The key architectural innovation is the Evolutionary Protein Generation Framework (EPGF), a test-time computation strategy that constrains the model's generative outputs to be biologically plausible. Standard autoregressive protein generation can produce statistically likely sequences that nonetheless violate physical and evolutionary constraints; EPGF addresses this by combining a multi-dimensional scorer, a hierarchical decoding strategy, and a probabilistic-biophysical joint selection mechanism that collectively guide sampling toward sequences with favorable structural and functional properties.
ProLLaMA is trained on a large instruction dataset containing approximately 13 million samples spanning over 11,000 protein superfamily annotations drawn from established classification databases. This breadth of functional annotation enables the model to condition generation on specific superfamily designations — a capability useful for targeted protein design — while the same representations underpin its classification performance.
[Generate by superfamily] Superfamily=<Ankyrin repeat-containing domain superfamily>), enabling directed generation of sequences with prescribed structural and functional characteristics.ProLLaMA is initialized from LLaMA-2-7B, a 7-billion-parameter autoregressive transformer, and trained in two stages. In Stage 1, the model undergoes continual pretraining on the UniRef50 database of non-redundant protein sequences, adapting the general language model's tokenization and representations to the amino acid alphabet while retaining natural language capabilities. In Stage 2, instruction tuning on approximately 13 million multi-task instruction samples covering superfamily prediction and sequence generation instills task-specific behavior; training uses bfloat16 precision with CUDA 11.7. The model is served via the HuggingFace Transformers library and employs a generation configuration with temperature 0.2, top-k 40, and top-p 0.9 to balance diversity and coherence.
On benchmarks, ProLLaMA achieves a 67.1% exact match rate on superfamily classification tasks. The EPGF generation strategy yields a +4.3% improvement in biophysical quality scores and a +14.5% improvement in structural metrics relative to baseline autoregressive decoding, demonstrating that the post-hoc biological constraint mechanism meaningfully improves output quality beyond what the base language model produces on its own. Model weights for both the Stage 1 pretrained variant (ProLLaMA_Stage_1) and the full instruction-tuned model are released on the HuggingFace Hub under an Apache 2.0 license.
ProLLaMA is well-suited for researchers who need a single model to alternate between protein function annotation and sequence design. Computational biologists can use its classification interface to assign superfamily labels to novel sequences from metagenomic datasets or directed evolution experiments. Protein engineers can leverage the conditional generation interface to produce candidate sequences constrained to a target structural scaffold or functional family, then filter outputs using downstream structure prediction tools such as ESMFold or AlphaFold 2. The instruction-following format also makes ProLLaMA accessible to wet-lab scientists who are more comfortable with natural-language queries than with bespoke bioinformatics scripting.
ProLLaMA demonstrates that general-purpose large language model architectures can be effectively adapted to the protein domain with relatively straightforward continual training, and that instruction tuning — a technique standard in NLP — transfers meaningfully to protein sequence tasks. Its unified PLU/PLG design anticipates the direction the field has taken toward more general protein foundation models capable of multi-task reasoning. A key limitation is that the model operates purely at the sequence level and does not model 3D structure directly; generated sequences must be validated by separate structure prediction tools to assess folding plausibility. The 7B parameter scale also places it at the lower end of modern LLM capacity, and future work may explore whether larger models or richer structural supervision further improve generation quality and functional accuracy.
Lv, L., Lin, Z., Li, H., Liu, Y., Cui, J., Chen, C. Y. C., Yuan, L., & Tian, Y. (2024). ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing. IEEE Transactions on Artificial Intelligence.
DOI: 10.48550/arXiv.2402.16445