bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

CodeFP

PharMolix Inc. / Tsinghua University

A co-generative protein language model that jointly decodes sequence and structure tokens from GO functional annotations for de novo functional protein design.

Released: May 2026

CodeFP is a co-generative protein language model for de novo functional protein design, developed by researchers at PharMolix Inc. in collaboration with Tsinghua University's Institute for AI Industry Research (AIR). The central problem it addresses is that generating a protein with a desired biological function is only useful if the resulting sequence also folds into a stable, viable structure. Many function-conditioned generators optimize sequence likelihood without an explicit structural objective, producing candidates that satisfy functional intent but fold poorly. CodeFP instead conditions generation directly on Gene Ontology (GO) functional annotations and decodes both the amino acid sequence and a structural representation of the same protein at once.

The model's key idea is joint, or "co-generative," decoding: at each step it predicts sequence tokens alongside structure tokens, so functional specification and structural viability are satisfied simultaneously rather than in separate stages. This contrasts with two-stage pipelines that first design a backbone and then thread a sequence through it, and with sequence-only function-conditioned generators that have no direct handle on foldability. By treating structure as a co-equal output channel, CodeFP aims to keep designs both on-function and physically plausible.

Introduced in a 2026 arXiv preprint (arXiv:2605.00948) by Xinrui Chen, Yizhen Luo, Siqi Fan, and Zaiqing Nie — with support from China's National Key R&D Program — CodeFP sits within the active landscape of function-conditioned protein generators alongside models such as ProGen2, Chroma, CFP-Gen, Pinal, and ProteoGAN.

#Key Features

  • Co-generative sequence and structure decoding: The model simultaneously emits amino acid sequence tokens and structure tokens for the same protein, enforcing functional and structural-viability constraints jointly rather than sequentially.
  • GO-annotation conditioning: Generation is steered by Gene Ontology functional annotations, allowing users to specify a target function and obtain candidate proteins designed to satisfy it.
  • Structure-token representation: Structural outputs are expressed as discrete tokens derived from a structure-aware tokenizer, letting a single language-model-style decoder handle both modalities in one vocabulary.
  • Foldability-aware generation: By optimizing structural viability during decoding, the model reports improved foldability metrics over sequence-only baselines, with a pLDDT > 70 rate of 80.65%.
  • Zero-shot functional design: CodeFP is evaluated as a fixed checkpoint on zero-shot generative benchmarks, generating function-conditioned proteins without task-specific fine-tuning.

#Technical Details

CodeFP is trained on 103.9K protein entries drawn from SwissProt and InterPro, spanning 375 GO terms, with paired structure tokens. The structural supervision is sourced from DPLM-2 together with experimental and predicted structures from the Protein Data Bank (PDB) and the AlphaFold Database, giving the model aligned sequence–structure–function triples to learn co-generation. The architecture follows a transformer language-model design that decodes interleaved sequence and structure tokens conditioned on GO annotations.

Evaluated as a fixed checkpoint on zero-shot generative benchmarks against ProGen2, Chroma, CFP-Gen, Pinal, and ProteoGAN, CodeFP reports a +6.1% improvement in functional consistency and a +3.2% improvement in foldability over the compared baselines, alongside an 80.65% rate of designs achieving pLDDT > 70. These results position the co-generative approach as competitive with established function-conditioned and structure-based generators on both functional-fidelity and structural-quality axes.

#Applications

CodeFP targets de novo functional protein design where a researcher specifies a desired biological function via GO annotations and needs candidate proteins that are both on-function and likely to fold. This is useful for synthetic biologists and protein engineers screening novel enzymes or functional scaffolds, particularly when no close natural template exists. Because the model jointly optimizes foldability, downstream wet-lab pipelines may spend less effort filtering out structurally implausible candidates before experimental testing. Its comparison against ProGen2, Chroma, and Pinal places it as a candidate generator within the same workflows those models serve.

#Impact

CodeFP contributes to the trend of unifying functional specification and structural viability in a single generative pass, rather than handling them as separate optimization problems. Its reported gains in functional consistency and foldability over established baselines suggest co-generative decoding is a promising direction for function-driven design. Several caveats temper this assessment: CodeFP is an arXiv preprint and has not yet undergone peer review; its benchmark results are computational and, as described, lack reported wet-lab validation. As of this writing no public code repository, model weights, or HuggingFace release has been confirmed, and the license is unknown, which currently limits independent reproduction and adoption. Verification of these results and an open release would strengthen the model's standing in the field.

Citation

Preprint

DOI: 10.48550/arXiv.2605.00948

DOI: 10.48550/arXiv.2605.00948

Openness

Unclassified
Missing required components

Tags

de_novo_designgenerativelanguage_modelprotein_designtransformer

Resources

Research Paper