bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Single-cell foundation models
Single-cellSpatial omics

PGL (Portraying Gene Language)

Seoul National University Hospital

LLM-based generative model that synthesizes complete single-cell RNA-seq profiles from tissue and disease metadata alone, treating cells as gene-expression token sequences.

Released: January 2026

PGL (Portraying Gene Language) is a generative model from the Department of Nuclear Medicine at Seoul National University Hospital that reframes single-cell transcriptomics as a language-generation problem. Each cell is represented as a long sequence of gene-expression tokens, and a large language model is trained to generate these sequences — in effect, writing out a cell's transcriptome the way a language model writes text. The striking capability that follows is conditional synthesis: PGL can produce complete, realistic single-cell RNA-seq profiles from metadata alone, such as a tissue type and disease context, without requiring a matched reference dataset.

This positions PGL differently from most single-cell foundation models (such as scGPT or Geneformer), which primarily learn representations of existing cells for downstream annotation and prediction. PGL is instead generative at the level of whole transcriptomes, aiming to synthesize biologically coherent cells de novo. The generated cells are not merely plausible in isolation: the authors report that they recapitulate dataset-specific transcriptomic structure, align with known cancer-subtype biology, and mix coherently with real single-cell datasets.

Posted to bioRxiv in January 2026, PGL explores whether the language-modeling paradigm that transformed natural-language generation can be applied to the "language" of gene expression, with single-cell and spatial transcriptomics as the proving ground.

#Key Features

  • Cells as token sequences: Represents each cell as a long ordered sequence of gene-expression tokens, casting transcriptome generation as next-token language modeling.
  • Metadata-conditioned synthesis: Generates complete scRNA-seq profiles from descriptive metadata such as tissue and disease, without a matched single-cell reference.
  • Biologically faithful output: Generated cells recapitulate dataset-specific structure and align with established cancer-subtype biology.
  • Integrates with real data: Synthetic cells mix coherently with real single-cell datasets rather than forming separable artifacts.
  • Reference-free spatial mapping: Generated cells serve as effective references for spatial transcriptomics, enabling cell-type mapping where no matched single-cell atlas exists.

#Technical Details

PGL is built on the large-language-model paradigm: a cell's transcriptome is serialized into a sequence of gene-expression tokens, and a transformer-based language model is trained to generate these sequences, conditioned on metadata describing the cell's tissue and disease context. At inference time the model synthesizes full single-cell expression profiles for a requested context. The reported evaluations focus on whether generated cells reproduce real transcriptomic structure — recapitulating cancer-subtype biology, mixing with real datasets in shared embeddings, and functioning as references for spatial cell-type deconvolution. Quantitative details of the language-model backbone (parameter count and base architecture) are not specified in the preprint, which is a notable gap for assessing the model's scale.

#Applications

PGL is aimed at researchers who need single-cell references for tissues, diseases, or conditions where real atlases are sparse or unavailable. By synthesizing context-specific cells from metadata, it can supply reference populations for spatial-transcriptomics cell-type mapping without a matched scRNA-seq dataset, augment datasets for rare conditions, and support in-silico exploration of disease-associated cell states. Its demonstrated alignment with cancer-subtype biology makes oncology a natural early application area, where matched single-cell data are often limited.

#Impact

PGL extends the generative language-modeling paradigm into whole-transcriptome synthesis, a direction that could reduce dependence on costly single-cell experiments for reference generation. Its ability to produce cells that integrate with real data and enable reference-free spatial mapping is its most consequential contribution. Adoption is constrained, however, by openness: the preprint is released under an All-Rights-Reserved (no-derivatives) license, no code or weights are publicly available, and the size of the underlying language model is unstated — limiting independent reproduction and scrutiny of generated-data fidelity, a concern that is especially salient for synthetic biological data.

Openness

bio.rodeo opennessClosed · low usability and reproducibility
2Closed
Usability — can I run it?1
Reproducibility — can I retrain it?0
not reproducible
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

data_generationgene_expressioncell_type_annotationtransformerlanguage_modelgenerativefoundation_modeltranscriptomicscancer

Resources

Research Paper