Harvard Medical School / Kempner Institute
Multimodal foundation model integrating protein sequence, structure, and natural language to model and generate protein phenotypes across scales.
ProCyon is a multimodal protein foundation model that unifies protein sequence, protein structure, and natural language in a single generative system, enabling researchers to ask open-ended questions about protein function and to retrieve proteins using free-form text. Where most protein language models produce fixed embeddings or predictions over a closed label set, ProCyon treats protein characterization as an instruction-following problem: a user can interleave one or more protein inputs inside a textual prompt and receive a natural-language answer, a generated phenotype description, or a ranked set of candidate proteins.
The model was developed by the Zitnik Lab (Marinka Zitnik's group, part of the Department of Biomedical Informatics) at Harvard Medical School together with the Kempner Institute, and released as a bioRxiv preprint in December 2024. It targets a long-standing bottleneck in functional genomics: the vast majority of proteins across organisms remain poorly annotated, and existing tools either require predefined ontologies or cannot reason jointly over molecular sequence and the descriptive text in which biological knowledge is recorded.
ProCyon's central contribution is breadth of phenotype coverage combined with zero-shot generalization. It spans five knowledge domains—molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions—and is designed to operate on proteins and phenotypes that were never seen during training.
ProCyon is an approximately 11-billion-parameter model that couples a large language model backbone with protein representation learning. The flagship ProCyon-Full and the binding-specialized ProCyon-Bind use a LLaMA-3-8B backbone, while ProCyon-Split—the benchmarking variant—uses LLaMA-2-7B; protein inputs are encoded with ESM2 and projected into the language model's token space. Training uses ProCyon-Instruct, a dataset of roughly 33.9 million protein-phenotype instructions assembled from 677,154 protein-phenotype pairs, covering 48,920 unique phenotypes across 56,753 proteins, domains, and peptides spanning the five knowledge domains. On a benchmark of fourteen biologically relevant tasks, ProCyon outperformed both single-modality and multimodal baselines, including ESM3, and generalized to thousands of held-out phenotypes. The model weights are distributed on Hugging Face but depend on gated access to the LLaMA backbone, making deployment conditionally open.
ProCyon supports functional annotation of understudied proteins, hypothesis generation in disease biology, and target discovery in drug development. Concrete use cases demonstrated by the authors include generating phenotype descriptions for poorly characterized human proteins, retrieving candidate proteins that match a drug's mechanism of action, identifying drug-binding protein domains, and predicting the functional consequences of mutations. Computational biologists, proteomics researchers, and pharmaceutical teams can query the model in natural language rather than mapping problems onto a fixed ontology.
ProCyon extends the protein foundation model paradigm beyond embeddings and structure prediction toward open-ended, language-grounded reasoning about protein phenotypes, positioning natural language as a unifying interface across heterogeneous biological knowledge. By releasing the model, the 33.9-million instruction ProCyon-Instruct dataset, and evaluation code under an MIT license, the authors provide a reusable substrate for protein-and-text multimodal research. Its main practical limitation is the dependence on gated LLaMA weights, which adds an access step relative to fully unrestricted releases, and—as a preprint—its benchmark claims await peer review and broader independent validation.
Queen, O., et al. (2025) ProCyon: A multimodal foundation model for protein phenotypes. bioRxiv.
DOI: 10.1101/2024.12.10.627665Papers that recently cited this model.
Lukas Fesser, Hanlin Zhang, Michelle M. Li, et al.
Jun 2026
Ryan R Cochrane, L. V. dos Santos, Yizhi Cai
Current Opinion in Biotechnology · Jun 2026
Nabin Giri, Steven Farrell, Kristofer E. Bouchard
May 2026
The most-cited papers that cite this model.
Qizhi Pei, Lijun Wu, Kaiyuan Gao, et al.
arXiv.org · Mar 2024
Ada Fang, Zaixi Zhang, Andrew Zhou, et al.
Zhangyang Gao, Hao Wang, Cheng Tan, et al.
arXiv.org · Jun 2025
Pengwei Sui, Michelle M. Li, Shanghua Gao, et al.
bioRxiv · Jan 2026
Tianle Zhang, Wanlong Fang, Jonathan Woo, et al.
arXiv.org · Sep 2025
Share of papers citing this model.