Westlake University / Microsoft Research Asia
A language model that generates small-molecule structures directly from transcriptomic phenotypes — gene up/down-regulation signatures — for phenotype-driven drug discovery.
Most computational drug discovery begins from a molecular target — a specific protein to inhibit or activate. But many therapeutic goals are defined instead by a desired biological phenotype: a pattern of gene expression changes that reverses a disease state. GEMGen tackles this phenotype-first problem directly, generating small molecules that are predicted to induce a specified transcriptomic signature, without requiring a known target.
GEMGen is a generative language model that takes a text-based description of a transcriptomic phenotype — sets of up- and down-regulated genes — and produces candidate small-molecule structures expected to elicit that gene-expression response. It was developed by researchers at Westlake University together with collaborators at Microsoft Research Asia, and released as a bioRxiv preprint in January 2026. By framing molecule design as conditional generation from a phenotype "prompt," GEMGen connects the large-scale chemical-perturbation data generated by transcriptomic screens to the practical task of proposing new chemical matter.
The model sits at the intersection of single-cell/bulk transcriptomics and small-molecule generative design, and is part of a growing class of methods that treat gene-expression signatures as a controllable target for molecular generation rather than as a downstream readout.
GEMGen is a large language model that operates over text-based representations of both transcriptomic phenotypes (gene up/down-regulation sets) and molecular structures, casting molecule design as conditional sequence generation. It is trained on large-scale chemical-perturbation transcriptomic data linking compounds to their induced gene-expression changes, allowing it to learn how molecular features translate into phenotypic responses. The authors report zero-shot transfer to genetic-perturbation signatures — a distinct data modality from the chemical perturbations used in training — and application to a fibrosis disease model, where the model generates candidate molecules for a target expression state. As a case study, GEMGen produced structurally novel inhibitors of KEAP1, a regulator of the NRF2 oxidative-stress pathway. The preprint is released under an all-rights-reserved license, and no public code or model weights accompany it at the time of writing.
GEMGen is intended for drug-discovery researchers pursuing phenotype-driven programs, where the goal is to reverse or induce a transcriptomic state rather than to hit a predefined target. Potential uses include proposing starting chemical matter for diseases characterized primarily by expression signatures (such as fibrosis), exploring molecules that mimic the effect of a genetic perturbation, and generating novel scaffolds against targets implicated by a gene-expression analysis. Because it requires only a phenotype description as input, it can complement target-based design in settings where the mechanism is incompletely understood.
GEMGen contributes to a shift toward phenotype-centric generative drug design, demonstrating that a language model can bridge transcriptomic signatures and chemical structure and even transfer across chemical and genetic perturbation modalities. Its discovery of novel KEAP1 inhibitors provides a concrete example of the approach yielding non-obvious chemical matter. As a 2026 preprint, its results await peer review and experimental validation, and the restrictive license together with the absence of released code or weights currently limits independent reproduction and adoption.