bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule foundation models
Small moleculeSingle-cell

GEMGen

Westlake University / Microsoft Research Asia

A language model that generates small-molecule structures directly from transcriptomic phenotypes — gene up/down-regulation signatures — for phenotype-driven drug discovery.

Released: January 2026

Most computational drug discovery begins from a molecular target — a specific protein to inhibit or activate. But many therapeutic goals are defined instead by a desired biological phenotype: a pattern of gene expression changes that reverses a disease state. GEMGen tackles this phenotype-first problem directly, generating small molecules that are predicted to induce a specified transcriptomic signature, without requiring a known target.

GEMGen is a generative language model that takes a text-based description of a transcriptomic phenotype — sets of up- and down-regulated genes — and produces candidate small-molecule structures expected to elicit that gene-expression response. It was developed by researchers at Westlake University together with collaborators at Microsoft Research Asia, and released as a bioRxiv preprint in January 2026. By framing molecule design as conditional generation from a phenotype "prompt," GEMGen connects the large-scale chemical-perturbation data generated by transcriptomic screens to the practical task of proposing new chemical matter.

The model sits at the intersection of single-cell/bulk transcriptomics and small-molecule generative design, and is part of a growing class of methods that treat gene-expression signatures as a controllable target for molecular generation rather than as a downstream readout.

#Key Features

  • Phenotype-to-molecule generation: Generates small molecules conditioned on a transcriptomic signature expressed as up- and down-regulated gene sets, rather than on a single protein target.
  • Trained on chemical-perturbation data: Learns the mapping between molecular structure and gene-expression response from large-scale chemical-perturbation transcriptomic datasets.
  • Zero-shot transfer to genetic perturbations: Generalizes from chemical perturbation signatures to genetic-perturbation signatures without retraining, broadening the range of phenotypes it can address.
  • Disease-model applicability: Demonstrated on a fibrosis disease model, proposing molecules aimed at reversing a disease-associated expression state.
  • Novel chemotype discovery: Identified structurally novel KEAP1 inhibitors, illustrating the model's ability to propose chemical matter beyond known scaffolds.

#Technical Details

GEMGen is a large language model that operates over text-based representations of both transcriptomic phenotypes (gene up/down-regulation sets) and molecular structures, casting molecule design as conditional sequence generation. It is trained on large-scale chemical-perturbation transcriptomic data linking compounds to their induced gene-expression changes, allowing it to learn how molecular features translate into phenotypic responses. The authors report zero-shot transfer to genetic-perturbation signatures — a distinct data modality from the chemical perturbations used in training — and application to a fibrosis disease model, where the model generates candidate molecules for a target expression state. As a case study, GEMGen produced structurally novel inhibitors of KEAP1, a regulator of the NRF2 oxidative-stress pathway. The preprint is released under an all-rights-reserved license, and no public code or model weights accompany it at the time of writing.

#Applications

GEMGen is intended for drug-discovery researchers pursuing phenotype-driven programs, where the goal is to reverse or induce a transcriptomic state rather than to hit a predefined target. Potential uses include proposing starting chemical matter for diseases characterized primarily by expression signatures (such as fibrosis), exploring molecules that mimic the effect of a genetic perturbation, and generating novel scaffolds against targets implicated by a gene-expression analysis. Because it requires only a phenotype description as input, it can complement target-based design in settings where the mechanism is incompletely understood.

#Impact

GEMGen contributes to a shift toward phenotype-centric generative drug design, demonstrating that a language model can bridge transcriptomic signatures and chemical structure and even transfer across chemical and genetic perturbation modalities. Its discovery of novel KEAP1 inhibitors provides a concrete example of the approach yielding non-obvious chemical matter. As a 2026 preprint, its results await peer review and experimental validation, and the restrictive license together with the absence of released code or weights currently limits independent reproduction and adoption.

Openness

bio.rodeo opennessClosed · low usability and reproducibility
9Closed
Usability — can I run it?7
Reproducibility — can I retrain it?10
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

molecule_generationdrug_discoveryde_novo_designtransformerlanguage_modelgenerativezero_shottranscriptomics

Resources

Research Paper