ProCyon

Harvard Medical School / Kempner Institute

Multimodal foundation model integrating protein sequence, structure, and natural language to model and generate protein phenotypes across scales.

Released: December 2024

Parameters: 11 Billion

ProCyon is a multimodal protein foundation model that unifies protein sequence, protein structure, and natural language in a single generative system, enabling researchers to ask open-ended questions about protein function and to retrieve proteins using free-form text. Where most protein language models produce fixed embeddings or predictions over a closed label set, ProCyon treats protein characterization as an instruction-following problem: a user can interleave one or more protein inputs inside a textual prompt and receive a natural-language answer, a generated phenotype description, or a ranked set of candidate proteins.

The model was developed by the Zitnik Lab (Marinka Zitnik's group, part of the Department of Biomedical Informatics) at Harvard Medical School together with the Kempner Institute, and released as a bioRxiv preprint in December 2024. It targets a long-standing bottleneck in functional genomics: the vast majority of proteins across organisms remain poorly annotated, and existing tools either require predefined ontologies or cannot reason jointly over molecular sequence and the descriptive text in which biological knowledge is recorded.

ProCyon's central contribution is breadth of phenotype coverage combined with zero-shot generalization. It spans five knowledge domains—molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions—and is designed to operate on proteins and phenotypes that were never seen during training.

Key Features

Multimodal instruction following: Accepts protein sequence and structure interleaved within natural-language prompts, so a single model handles diverse queries without task-specific heads or fixed output vocabularies.
Zero-shot phenotype generation: Produces free-form text descriptions ("captions") of protein function and maintains strong performance on 3,250 completely unseen phenotypes.
Text-based protein retrieval: Ranks and retrieves proteins from natural-language descriptions, including queries phrased as a drug's mechanism of action.
Cross-domain reasoning: Combines evidence across molecular function, disease, and interaction domains to characterize poorly annotated human proteins.
Open release with variants: Code, the training dataset, and intermediate checkpoints are released, with model variants tuned for different use cases.

Technical Details

ProCyon is an approximately 11-billion-parameter model that couples a large language model backbone with protein representation learning. The flagship ProCyon-Full and the binding-specialized ProCyon-Bind use a LLaMA-3-8B backbone, while ProCyon-Split—the benchmarking variant—uses LLaMA-2-7B; protein inputs are encoded with ESM2 and projected into the language model's token space. Training uses ProCyon-Instruct, a dataset of roughly 33.9 million protein-phenotype instructions assembled from 677,154 protein-phenotype pairs, covering 48,920 unique phenotypes across 56,753 proteins, domains, and peptides spanning the five knowledge domains. On a benchmark of fourteen biologically relevant tasks, ProCyon outperformed both single-modality and multimodal baselines, including ESM3, and generalized to thousands of held-out phenotypes. The model weights are distributed on Hugging Face but depend on gated access to the LLaMA backbone, making deployment conditionally open.

Applications

ProCyon supports functional annotation of understudied proteins, hypothesis generation in disease biology, and target discovery in drug development. Concrete use cases demonstrated by the authors include generating phenotype descriptions for poorly characterized human proteins, retrieving candidate proteins that match a drug's mechanism of action, identifying drug-binding protein domains, and predicting the functional consequences of mutations. Computational biologists, proteomics researchers, and pharmaceutical teams can query the model in natural language rather than mapping problems onto a fixed ontology.

Impact

ProCyon extends the protein foundation model paradigm beyond embeddings and structure prediction toward open-ended, language-grounded reasoning about protein phenotypes, positioning natural language as a unifying interface across heterogeneous biological knowledge. By releasing the model, the 33.9-million instruction ProCyon-Instruct dataset, and evaluation code under an MIT license, the authors provide a reusable substrate for protein-and-text multimodal research. Its main practical limitation is the dependence on gated LLaMA weights, which adds an access step relative to fully unrestricted releases, and—as a preprint—its benchmark claims await peer review and broader independent validation.

Citation

ProCyon: A multimodal foundation model for protein phenotypes

Preprint

Queen, O., et al. (2025) ProCyon: A multimodal foundation model for protein phenotypes. bioRxiv.

DOI: 10.1101/2024.12.10.627665

Recent citations

Papers that recently cited this model.

How Post-Training Shapes Biological Reasoning Models
Lukas Fesser, Hanlin Zhang, Michelle M. Li, et al.
Jun 2026
0
The convergence of AI-driven engineering biology and emerging technologies advancing globally networked autonomous biofoundries.
Ryan R Cochrane, L. V. dos Santos, Yizhi Cai
Current Opinion in Biotechnology · Jun 2026
0
Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation
Nabin Giri, Steven Farrell, Kristofer E. Bouchard
May 2026
0

Top citations

The most-cited papers that cite this model.

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey
Qizhi Pei, Lijun Wu, Kaiyuan Gao, et al.
arXiv.org · Mar 2024
27
ATOMICA: Learning Universal Representations of Intermolecular Interactions
Ada Fang, Zaixi Zhang, Andrew Zhou, et al.
9
PFMBench: Protein Foundation Model Benchmark
Zhangyang Gao, Hao Wang, Cheng Tan, et al.
arXiv.org · Jun 2025
6
Medea: An omics AI agent for therapeutic discovery
Pengwei Sui, Michelle M. Li, Shanghua Gao, et al.
bioRxiv · Jan 2026
5
Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning
Tianle Zhang, Wanlong Fang, Jonathan Woo, et al.
arXiv.org · Sep 2025
5

Citations

Total Citations13

Influential0

References0

GitHub

Stars60

Forks15

Open Issues0

Contributors4

Last Push7mo ago

LanguagePython

LicenseMIT

HuggingFace

Downloads0

Likes11

Last Modified1y ago

Fields of citing research

Biology91%
Computer Science91%
Medicine45%
Chemistry27%
Engineering9%
Environmental Science9%
Materials Science9%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

83Open

Usability — can I run it?71

Reproducibility — can I retrain it?95

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website HuggingFace Model Dataset

Key Features

Multimodal instruction following: Accepts protein sequence and structure interleaved within natural-language prompts, so a single model handles diverse queries without task-specific heads or fixed output vocabularies.

Zero-shot phenotype generation: Produces free-form text descriptions ("captions") of protein function and maintains strong performance on 3,250 completely unseen phenotypes.

Text-based protein retrieval: Ranks and retrieves proteins from natural-language descriptions, including queries phrased as a drug's mechanism of action.

Cross-domain reasoning: Combines evidence across molecular function, disease, and interaction domains to characterize poorly annotated human proteins.

Open release with variants: Code, the training dataset, and intermediate checkpoints are released, with model variants tuned for different use cases.

Technical Details

Applications

Impact

Recent citations

Papers that recently cited this model.

How Post-Training Shapes Biological Reasoning Models

Lukas Fesser, Hanlin Zhang, Michelle M. Li, et al.

Jun 2026

The convergence of AI-driven engineering biology and emerging technologies advancing globally networked autonomous biofoundries.

Ryan R Cochrane, L. V. dos Santos, Yizhi Cai

Current Opinion in Biotechnology · Jun 2026

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Nabin Giri, Steven Farrell, Kristofer E. Bouchard

May 2026

ProCyon

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProCyon: A multimodal foundation model for protein phenotypes

Recent citations

How Post-Training Shapes Biological Reasoning Models

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Top citations

ATOMICA: Learning Universal Representations of Intermolecular Interactions

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

ProCyon

#Key Features

#Technical Details

#Applications

#Impact

Citation

ProCyon: A multimodal foundation model for protein phenotypes

Recent citations

How Post-Training Shapes Biological Reasoning Models

Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

Top citations

ATOMICA: Learning Universal Representations of Intermolecular Interactions

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact