Genomic language model trained on metagenomic scaffolds that learns protein co-regulation and function by modeling gene context and operon structure.
gLM (genomic Language Model) is a transformer-based model trained on millions of unlabeled metagenomic scaffolds to learn the functional and regulatory relationships encoded in the arrangement of genes across microbial genomes. Rather than modeling individual proteins in isolation, gLM treats a stretch of genomic sequence as an ordered sentence of genes and learns to predict masked members of that sentence from their neighbors — capturing the contextual logic that governs how genes are co-expressed and co-regulated in nature.
The core insight motivating gLM is that prokaryotic genomes are not random collections of genes: functionally related genes cluster into operons, are subject to shared regulatory control, and exhibit non-random co-occurrence patterns across diverse microbial lineages. By training on the largest and most ecologically diverse genomic corpus available — microbial metagenomes drawn from the ocean, soil, and human gut — gLM is exposed to an enormous range of genomic contexts that collectively encode the regulatory syntax of microbial life.
The model was developed by Yunha Hwang, Andre L. Cornman, Elizabeth H. Kellogg, Sergey Ovchinnikov (MIT), and Peter R. Girguis (Harvard University), and was published in Nature Communications in April 2024. It represents one of the first large-scale attempts to apply masked language modeling at the level of whole genomic neighborhoods rather than individual protein sequences.
gLM is a transformer architecture trained under a masked language modeling (MLM) objective analogous to BERT, but operating at the gene rather than the nucleotide or amino acid level. Each token in the input sequence represents a single gene, embedded via a pre-trained protein language model. The model learns to reconstruct masked gene embeddings from the surrounding genomic context window. Training data consists of millions of metagenomic scaffolds drawn from publicly available environmental sequencing datasets, providing a highly diverse corpus of prokaryotic genome organization.
At inference, gLM produces 1,280-dimensional contextualized protein embeddings for each gene in a scaffold. These embeddings can be directly used as input features for downstream classifiers or analyzed via attention weight extraction to identify co-regulated gene modules. The reliance on pLM embeddings as input rather than raw sequences means gLM builds on prior sequence-level representations and focuses its learned capacity on the inter-gene relationship structure. No explicit regulatory annotation is used during training; all learned regulatory structure emerges from the statistical patterns present in the metagenomic corpus.
gLM is well suited for researchers working in microbial genomics, metagenomics, and microbial ecology who need to infer gene function or regulatory relationships from sequencing data alone, without relying on curated databases or cultured reference genomes. Specific applications include automated operon boundary prediction in novel metagenome-assembled genomes, enzyme function annotation for genes from understudied organisms, and contig-level taxonomic classification in mixed-community samples. The model is also applicable to paralog disambiguation — distinguishing functionally divergent copies of related genes based on their genomic neighbors rather than sequence similarity alone. Researchers studying horizontal gene transfer or biosynthetic gene clusters may also benefit from gLM's ability to detect co-regulation signals embedded in genomic arrangement.
gLM demonstrates that the regulatory syntax of prokaryotic genomes can be learned in an entirely unsupervised manner from raw metagenomic data, without any curated operon annotations or functional labels during training. This establishes a clear path toward genomic foundation models that operate at the scale of gene neighborhoods rather than individual sequences, complementing protein language models like ESM-2 that focus on single-protein representations. The paper's attention analysis showing emergent operon recovery is a notable validation that biologically meaningful structure is encoded in the model's learned representations. A current limitation is that gLM was developed for prokaryotic genomes with their compact, polycistronic organization; direct extension to eukaryotic genomes, which lack operons and have far more complex regulatory architectures, would require substantial adaptation. The publicly available model checkpoint and training code on GitHub lower the barrier for adoption by the microbial genomics community.
Hwang, Y., et al. (2023) Genomic language model predicts protein co-regulation and function. bioRxiv.
DOI: 10.1038/s41467-024-46947-9