GEMGen

Westlake University / Microsoft Research Asia

Generative language model for phenotype-driven drug discovery, proposing small-molecule structures from up- and down-regulated gene signatures.

Released: January 2026

Most computational drug discovery begins from a molecular target — a specific protein to inhibit or activate. But many therapeutic goals are defined instead by a desired biological phenotype: a pattern of gene expression changes that reverses a disease state. GEMGen tackles this phenotype-first problem directly, generating small molecules that are predicted to induce a specified transcriptomic signature, without requiring a known target.

GEMGen is a generative language model that takes a text-based description of a transcriptomic phenotype — sets of up- and down-regulated genes — and produces candidate small-molecule structures expected to elicit that gene-expression response. It was developed by researchers at Westlake University together with collaborators at Microsoft Research Asia, and released as a bioRxiv preprint in January 2026. By framing molecule design as conditional generation from a phenotype "prompt," GEMGen connects the large-scale chemical-perturbation data generated by transcriptomic screens to the practical task of proposing new chemical matter.

The model sits at the intersection of single-cell/bulk transcriptomics and small-molecule generative design, and is part of a growing class of methods that treat gene-expression signatures as a controllable target for molecular generation rather than as a downstream readout.

Key Features

Phenotype-to-molecule generation: Generates small molecules conditioned on a transcriptomic signature expressed as up- and down-regulated gene sets, rather than on a single protein target.
Trained on chemical-perturbation data: Learns the mapping between molecular structure and gene-expression response from large-scale chemical-perturbation transcriptomic datasets.
Zero-shot transfer to genetic perturbations: Generalizes from chemical perturbation signatures to genetic-perturbation signatures without retraining, broadening the range of phenotypes it can address.
Disease-model applicability: Demonstrated on a fibrosis disease model, proposing molecules aimed at reversing a disease-associated expression state.
Novel chemotype discovery: Identified structurally novel KEAP1 inhibitors, illustrating the model's ability to propose chemical matter beyond known scaffolds.

Technical Details

GEMGen is a large language model that operates over text-based representations of both transcriptomic phenotypes (gene up/down-regulation sets) and molecular structures, casting molecule design as conditional sequence generation. It is trained on large-scale chemical-perturbation transcriptomic data linking compounds to their induced gene-expression changes, allowing it to learn how molecular features translate into phenotypic responses. The authors report zero-shot transfer to genetic-perturbation signatures — a distinct data modality from the chemical perturbations used in training — and application to a fibrosis disease model, where the model generates candidate molecules for a target expression state. As a case study, GEMGen produced structurally novel inhibitors of KEAP1, a regulator of the NRF2 oxidative-stress pathway. The preprint is released under an all-rights-reserved license, and no public code or model weights accompany it at the time of writing.

Applications

GEMGen is intended for drug-discovery researchers pursuing phenotype-driven programs, where the goal is to reverse or induce a transcriptomic state rather than to hit a predefined target. Potential uses include proposing starting chemical matter for diseases characterized primarily by expression signatures (such as fibrosis), exploring molecules that mimic the effect of a genetic perturbation, and generating novel scaffolds against targets implicated by a gene-expression analysis. Because it requires only a phenotype description as input, it can complement target-based design in settings where the mechanism is incompletely understood.

Impact

GEMGen contributes to a shift toward phenotype-centric generative drug design, demonstrating that a language model can bridge transcriptomic signatures and chemical structure and even transfer across chemical and genetic perturbation modalities. Its discovery of novel KEAP1 inhibitors provides a concrete example of the approach yielding non-obvious chemical matter. As a 2026 preprint, its results await peer review and experimental validation, and the restrictive license together with the absence of released code or weights currently limits independent reproduction and adoption.

Citation

Phenotype-Guided In Silico Molecular Generation Using Large Language Models

Jiang, Q., et al. (2026) Phenotype-Guided In Silico Molecular Generation Using Large Language Models. bioRxiv.

DOI: 10.64898/2026.01.03.697483

Recent citations

Papers that recently cited this model.

Unbalanced Perturbation Dynamics For Cell Fate Design
Qiangwei Peng, Yuchuan Wang, Jianzhen Li, et al.
bioRxiv · Jul 2026
0
PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion
Lukasz Janisi'ow, Sebastian Musiał, Bartosz Zieli'nski, et al.
May 2026
0

Top citations

The most-cited papers that cite this model.

Unbalanced Perturbation Dynamics For Cell Fate Design
Qiangwei Peng, Yuchuan Wang, Jianzhen Li, et al.
bioRxiv · Jul 2026
0
PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion
Lukasz Janisi'ow, Sebastian Musiał, Bartosz Zieli'nski, et al.
May 2026
0

Citations

Total Citations2

Influential0

References3

Fields of citing research

Computer Science100%
Medicine50%
Biology50%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

9Closed

Usability — can I run it?7

Reproducibility — can I retrain it?10

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Phenotype-to-molecule generation: Generates small molecules conditioned on a transcriptomic signature expressed as up- and down-regulated gene sets, rather than on a single protein target.

Trained on chemical-perturbation data: Learns the mapping between molecular structure and gene-expression response from large-scale chemical-perturbation transcriptomic datasets.

Zero-shot transfer to genetic perturbations: Generalizes from chemical perturbation signatures to genetic-perturbation signatures without retraining, broadening the range of phenotypes it can address.

Disease-model applicability: Demonstrated on a fibrosis disease model, proposing molecules aimed at reversing a disease-associated expression state.

Novel chemotype discovery: Identified structurally novel KEAP1 inhibitors, illustrating the model's ability to propose chemical matter beyond known scaffolds.

Technical Details

Applications

Impact

GEMGen

Key Features

Technical Details

Applications

Impact

Citation

Phenotype-Guided In Silico Molecular Generation Using Large Language Models

Recent citations

Unbalanced Perturbation Dynamics For Cell Fate Design

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Top citations

Unbalanced Perturbation Dynamics For Cell Fate Design

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Citations

Fields of citing research

Openness

Tags

Resources

GEMGen

Key Features

Technical Details

Applications

Impact

Citation

Phenotype-Guided In Silico Molecular Generation Using Large Language Models

Recent citations

Unbalanced Perturbation Dynamics For Cell Fate Design

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Top citations

Unbalanced Perturbation Dynamics For Cell Fate Design

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Citations

Fields of citing research

Openness

Tags

Resources

GEMGen

#Key Features

#Technical Details

#Applications

#Impact

Citation

Phenotype-Guided In Silico Molecular Generation Using Large Language Models

Recent citations

Unbalanced Perturbation Dynamics For Cell Fate Design

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Top citations

Unbalanced Perturbation Dynamics For Cell Fate Design

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Related models

Citations

Fields of citing research

Openness

Tags

Resources

GEMGen

#Key Features

#Technical Details

#Applications

#Impact

Citation

Phenotype-Guided In Silico Molecular Generation Using Large Language Models

Recent citations

Unbalanced Perturbation Dynamics For Cell Fate Design

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Top citations

Unbalanced Perturbation Dynamics For Cell Fate Design

PhAME: Phenotype-Aware Molecular Editing via Latent Diffusion

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact