bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

PLAID

UC Berkeley / Genentech

Latent diffusion model for controllable all-atom protein generation that co-designs sequence and structure while training on sequences alone.

Released: December 2024
Parameters: 2 Billion

PLAID (Protein Latent Induced Diffusion) is a generative model for designing proteins that produces both amino-acid sequence and all-atom 3D structure in a single sampling process. It was developed by Amy X. Lu and collaborators at UC Berkeley and Genentech (with co-authors including Nathan Frey, Frances Arnold, and Pieter Abbeel) and released as a preprint in December 2024. The work tackles a long-standing tension in protein generative modeling: structure-based diffusion models require expensive experimentally solved structures for training, while sequence-only language models do not directly yield 3D coordinates or side-chain placements.

PLAID's central idea is to run latent diffusion over the internal representation space of a pretrained structure predictor rather than over raw coordinates or tokens. Because that latent space already entangles sequence and structure, PLAID can be trained using only protein sequences—orders of magnitude more abundant than solved structures—yet still generate full all-atom outputs by decoding samples back through the structure predictor. This sidesteps the data bottleneck that constrains many backbone-generation methods.

The model supports conditional generation guided by biological function and taxonomy, making it a controllable design tool rather than only an unconditional sampler. It sits alongside all-atom design approaches such as Protpardelle and RFdiffusionAA, but is distinguished by its sequence-only training signal and its function/organism conditioning.

#Key Features

  • Sequence-only training, all-atom output: PLAID learns from sequence databases alone but generates complete all-atom structures by sampling in the latent space of the ESMFold structure predictor, avoiding the need for solved structures during training.
  • Compressed latent diffusion: Diffusion is performed over the compact CHEAP autoencoder latent (a separate component) rather than over high-dimensional coordinates, making training and sampling tractable.
  • Function and organism conditioning: Classifier-free guidance on Gene Ontology (GO) function terms and organism taxonomy lets users steer generation toward desired biological properties.
  • Experimental validation: The authors report wet-lab characterization of generated heme-binding proteins, providing evidence that conditioning produces functionally relevant designs rather than only plausible structures.
  • Open weights and code: Both a 2B-parameter and a 100M-parameter checkpoint are released under an MIT license, with the diffusion weights hosted on HuggingFace.

#Technical Details

PLAID is a latent diffusion model operating over the shared sequence-structure representation derived from ESMFold, compressed by the CHEAP autoencoder. Two model sizes are released: a 2-billion-parameter variant and a 100-million-parameter variant. Training uses only protein sequences, with the structural decoder providing the bridge to all-atom coordinates at inference time. Conditional generation is implemented via classifier-free guidance over GO function indices and organism taxonomy indices, and the pipeline can determine protein length automatically. The released code requires a custom OpenFold fork and the companion CHEAP latent autoencoder, with model caches handled automatically.

#Applications

PLAID is aimed at protein engineers and computational biologists who need to generate novel candidate proteins with targeted function or taxonomic context—for example, proposing enzymes or binding proteins associated with a particular GO annotation. Because it emits all-atom structures alongside sequences, downstream users can immediately inspect side-chain geometry, dock cofactors, or filter candidates structurally before ordering genes for wet-lab testing, as demonstrated for heme-binding designs.

#Impact

PLAID demonstrates that controllable all-atom protein generation can be driven primarily by abundant sequence data, lowering the structural-data barrier that limits many diffusion approaches. Its function- and organism-conditioned sampling, paired with open 2B and 100M checkpoints and experimental validation, makes it a practical reference point for the growing class of all-atom generative protein models. As a preprint with released weights, its long-term influence will depend on independent benchmarking and peer review, but it contributes a notable design pattern: diffusing in a learned sequence-structure latent rather than over explicit coordinates.

GitHub

Stars127
Forks14
Open Issues0
Contributors2
Last Push1y ago
LanguagePython
LicenseMIT

HuggingFace

Downloads0
Likes4
Last Modified1y ago

Openness

bio.rodeo opennessFully open · usable and reproducible
77Open
Usability — can I run it?90
Reproducibility — can I retrain it?62
Model Openness Framework
Unclassified
Missing required components

Tags

protein_designde_novo_designstructure_predictiondiffusiontransformergenerativeself_supervised

Resources

GitHub RepositoryResearch PaperHuggingFace Model