bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
ProteinLanguage modelSmall molecule

ProCyon

Harvard Medical School / Kempner Institute

Multimodal foundation model integrating protein sequence, structure, and natural language to model and generate protein phenotypes across scales.

Released: December 2024
Parameters: 11 Billion

ProCyon is a multimodal protein foundation model that unifies protein sequence, protein structure, and natural language in a single generative system, enabling researchers to ask open-ended questions about protein function and to retrieve proteins using free-form text. Where most protein language models produce fixed embeddings or predictions over a closed label set, ProCyon treats protein characterization as an instruction-following problem: a user can interleave one or more protein inputs inside a textual prompt and receive a natural-language answer, a generated phenotype description, or a ranked set of candidate proteins.

The model was developed by the Zitnik Lab (Marinka Zitnik's group, part of the Department of Biomedical Informatics) at Harvard Medical School together with the Kempner Institute, and released as a bioRxiv preprint in December 2024. It targets a long-standing bottleneck in functional genomics: the vast majority of proteins across organisms remain poorly annotated, and existing tools either require predefined ontologies or cannot reason jointly over molecular sequence and the descriptive text in which biological knowledge is recorded.

ProCyon's central contribution is breadth of phenotype coverage combined with zero-shot generalization. It spans five knowledge domains—molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions—and is designed to operate on proteins and phenotypes that were never seen during training.

#Key Features

  • Multimodal instruction following: Accepts protein sequence and structure interleaved within natural-language prompts, so a single model handles diverse queries without task-specific heads or fixed output vocabularies.
  • Zero-shot phenotype generation: Produces free-form text descriptions ("captions") of protein function and maintains strong performance on 3,250 completely unseen phenotypes.
  • Text-based protein retrieval: Ranks and retrieves proteins from natural-language descriptions, including queries phrased as a drug's mechanism of action.
  • Cross-domain reasoning: Combines evidence across molecular function, disease, and interaction domains to characterize poorly annotated human proteins.
  • Open release with variants: Code, the training dataset, and intermediate checkpoints are released, with model variants tuned for different use cases.

#Technical Details

ProCyon is an approximately 11-billion-parameter model that couples a large language model backbone with protein representation learning. The flagship ProCyon-Full and the binding-specialized ProCyon-Bind use a LLaMA-3-8B backbone, while ProCyon-Split—the benchmarking variant—uses LLaMA-2-7B; protein inputs are encoded with ESM2 and projected into the language model's token space. Training uses ProCyon-Instruct, a dataset of roughly 33.9 million protein-phenotype instructions assembled from 677,154 protein-phenotype pairs, covering 48,920 unique phenotypes across 56,753 proteins, domains, and peptides spanning the five knowledge domains. On a benchmark of fourteen biologically relevant tasks, ProCyon outperformed both single-modality and multimodal baselines, including ESM3, and generalized to thousands of held-out phenotypes. The model weights are distributed on Hugging Face but depend on gated access to the LLaMA backbone, making deployment conditionally open.

#Applications

ProCyon supports functional annotation of understudied proteins, hypothesis generation in disease biology, and target discovery in drug development. Concrete use cases demonstrated by the authors include generating phenotype descriptions for poorly characterized human proteins, retrieving candidate proteins that match a drug's mechanism of action, identifying drug-binding protein domains, and predicting the functional consequences of mutations. Computational biologists, proteomics researchers, and pharmaceutical teams can query the model in natural language rather than mapping problems onto a fixed ontology.

#Impact

ProCyon extends the protein foundation model paradigm beyond embeddings and structure prediction toward open-ended, language-grounded reasoning about protein phenotypes, positioning natural language as a unifying interface across heterogeneous biological knowledge. By releasing the model, the 33.9-million instruction ProCyon-Instruct dataset, and evaluation code under an MIT license, the authors provide a reusable substrate for protein-and-text multimodal research. Its main practical limitation is the dependence on gated LLaMA weights, which adds an access step relative to fully unrestricted releases, and—as a preprint—its benchmark claims await peer review and broader independent validation.

Citation

ProCyon: A multimodal foundation model for protein phenotypes

Preprint

Queen, O., et al. (2025) ProCyon: A multimodal foundation model for protein phenotypes. bioRxiv.

DOI: 10.1101/2024.12.10.627665

Recent citations

Papers that recently cited this model.

  • How Post-Training Shapes Biological Reasoning Models

    Lukas Fesser, Hanlin Zhang, Michelle M. Li, et al.

    Jun 2026

    0
  • The convergence of AI-driven engineering biology and emerging technologies advancing globally networked autonomous biofoundries.

    Ryan R Cochrane, L. V. dos Santos, Yizhi Cai

    Current Opinion in Biotechnology · Jun 2026

    0
  • Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

    Nabin Giri, Steven Farrell, Kristofer E. Bouchard

    May 2026

    0

Top citations

The most-cited papers that cite this model.

  • Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

    Qizhi Pei, Lijun Wu, Kaiyuan Gao, et al.

    arXiv.org · Mar 2024

    27
  • ATOMICA: Learning Universal Representations of Intermolecular Interactions

    Ada Fang, Zaixi Zhang, Andrew Zhou, et al.

    9
  • PFMBench: Protein Foundation Model Benchmark

    Zhangyang Gao, Hao Wang, Cheng Tan, et al.

    arXiv.org · Jun 2025

    6
  • Medea: An omics AI agent for therapeutic discovery

    Pengwei Sui, Michelle M. Li, Shanghua Gao, et al.

    bioRxiv · Jan 2026

    5
  • Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

    Tianle Zhang, Wanlong Fang, Jonathan Woo, et al.

    arXiv.org · Sep 2025

    5

Citations

Total Citations13
Influential0
References0

GitHub

Stars60
Forks15
Open Issues0
Contributors4
Last Push7mo ago
LanguagePython
LicenseMIT

HuggingFace

Downloads0
Likes11
Last Modified1y ago

Fields of citing research

  • Biology91%
  • Computer Science91%
  • Medicine45%
  • Chemistry27%
  • Engineering9%
  • Environmental Science9%
  • Materials Science9%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible
83Open
Usability — can I run it?71
Reproducibility — can I retrain it?95
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

drug_discoveryfoundation_modelmultimodalphenotype_generationprotein_retrievalproteomicstransformervariant_effect_predictionzero_shot

Resources

GitHub RepositoryResearch PaperOfficial WebsiteHuggingFace ModelDataset