Overview

GenSLM (Genome-Scale Language Models) is a family of large language models developed by researchers at Argonne National Laboratory and collaborating institutions, designed to learn the evolutionary and functional grammar of biological genomes at scale. Published as a preprint in October 2022 and subsequently in The International Journal of High Performance Computing Applications, GenSLM won the 2022 Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research — one of the most prestigious awards in supercomputing — for its demonstration that large language models trained on viral genome sequences can recapitulate and anticipate viral evolutionary dynamics, including the emergence of SARS-CoV-2 variants of concern such as Delta and Omicron.

The core scientific motivation for GenSLM was to determine whether the same transformer-based language modeling approach that enabled GPT and BERT to learn human language could learn the "language" of genomes: the combinatorial rules by which nucleotide and codon sequences encode biological function and evolutionary fitness. Rather than working at the protein sequence level (as in ESM) or on specific functional predictions (as in Enformer), GenSLM operates directly on full-genome sequences, treating each genome as a "sentence" whose statistical regularities reflect the evolutionary pressures that shaped it. This approach enables the model to learn population-level evolutionary patterns across thousands of viral genomes without explicit annotations.

The model architecture and training represent a landmark in the application of large-scale supercomputing to biological sequence modeling. Pre-training used 110 million prokaryotic gene sequences from the Joint Genome Institute IMG database to learn general genome-scale representations, followed by fine-tuning on 1.5 million complete SARS-CoV-2 genome sequences deposited in GISAID during the first year of the COVID-19 pandemic. The resulting model, when evaluated retrospectively, was able to identify the precursors of the Delta and Omicron variants in its latent space before those variants achieved global dominance — suggesting the model learned something meaningful about the fitness landscape of the virus.

Key Features

Gordon Bell Prize-winning scale: Trained using the Argonne Leadership Computing Facility's Polaris supercomputer, NVIDIA's Selene system, and Cerebras CS-2 wafer-scale chips — enabling training at unprecedented speed and scale for biological sequence models.
Hierarchical pretraining strategy: Pre-trained on 110 million prokaryotic gene sequences to learn general genomic representations, then fine-tuned on 1.5 million SARS-CoV-2 complete genomes, enabling rapid adaptation to viral evolutionary modeling with substantially less target-domain data.
Variant of concern identification: Latent space representations of viral genomes cluster by variant lineage, and the model's attention patterns and embeddings anticipate the fitness advantage of emerging lineages, providing a potential early-warning system for pandemic preparedness.
Codon-level tokenization: Genomes are tokenized at the codon level (triplets of nucleotides) rather than individual bases, reflecting the natural unit of protein-coding information and enabling the model to learn reading-frame-aware representations of coding sequences.
Scalable to multiple pathogen families: While demonstrated on SARS-CoV-2, the general architecture and training approach are applicable to influenza, HIV, and other RNA viruses whose evolutionary dynamics are similarly driven by selection on genomic sequence.
Interpretable evolutionary representations: UMAP visualizations of genome embeddings show clear clustering by variant lineage and temporal ordering that mirrors known evolutionary history, providing qualitative validation of the learned representations.

Technical Details

GenSLM builds on transformer language model architectures, specifically leveraging GPT-style autoregressive and BERT-style masked language modeling designs scaled to genome-length sequences. The models were trained at multiple parameter scales, including models with hundreds of millions to over one billion parameters, to characterize scaling behavior for genomic sequence data. Genome sequences were tokenized at the codon level (64 possible codons plus special tokens) for coding regions, and at the nucleotide level for regulatory and intergenic regions, with sequence lengths accommodating full-length viral genomes of approximately 30,000 nucleotides for SARS-CoV-2. Pre-training on prokaryotic sequences used a masked language modeling objective on gene-level sequences from the IMG/M metagenome database, providing approximately 110 million gene sequences spanning diverse bacterial and archaeal lineages. Fine-tuning on SARS-CoV-2 used complete genome sequences from GISAID, filtered for high-quality assemblies. The training infrastructure required the coordination of multiple Petascale and Exascale supercomputing systems: ALCF's Polaris supercomputer (56,320 NVIDIA A100 GPUs on a larger predecessor machine), NVIDIA Selene, and Cerebras CS-2 systems in the ALCF AI Testbed, which enabled training speeds not achievable on conventional GPU clusters due to the Cerebras wafer-scale architecture's elimination of inter-chip memory bandwidth bottlenecks. In held-out validation, GenSLM's latent representations, derived from genomes sequenced in 2020, showed prospective separation of the Delta (B.1.617.2) and Omicron (B.1.1.529) lineage precursors, a property not achievable by sequence alignment alone.

Applications

GenSLM's primary demonstrated application is pandemic surveillance and variant monitoring for SARS-CoV-2 and, by extension, other rapidly evolving RNA viruses. By encoding viral genomes as dense vectors in a learned latent space, the model provides a compact representation of genomic diversity that can be used to cluster emerging lineages, track evolutionary trajectories, and potentially flag sequences with unusual fitness signatures before they achieve global spread. This capability is directly relevant to pandemic preparedness: traditional phylogenetic methods require substantial sequences to establish a new lineage, whereas model-based approaches can potentially identify anomalous sequences earlier in the evolutionary process. Beyond COVID-19 surveillance, GenSLM demonstrates the feasibility of training large genome-scale language models at supercomputing facilities, establishing protocols for distributed training of biological sequence models that are now being applied to bacterial, archaeal, and eukaryotic genomes. The codon-level tokenization and genome-scale modeling paradigm established by GenSLM also informs the design of subsequent foundation models for microbial genomics.

Impact

GenSLM's receipt of the 2022 Gordon Bell Special Prize brought significant attention to the intersection of high-performance computing and biological sequence modeling, demonstrating to the broader supercomputing community that large language model training infrastructure built for NLP can be directly repurposed for genomic applications with scientifically meaningful results. The publication established a rigorous retrospective validation framework for evaluating genomic language model predictions — comparing model-learned representations to known variant emergence timelines — that has been adopted in subsequent work on evolutionary sequence modeling. The collaboration between Argonne National Laboratory, NVIDIA, Cerebras Systems, and multiple academic institutions demonstrated a model for public-private partnerships in AI for science that has influenced subsequent DOE-funded genomic AI initiatives. Key limitations include the model's focus on short RNA viral genomes (SARS-CoV-2 at ~30 kb), which are substantially smaller and more tractable than eukaryotic chromosomes, and the current inability to attribute variant fitness predictions to specific sequence features in a mechanistically interpretable way.

Overview

Key Features

Gordon Bell Prize-winning scale: Trained using the Argonne Leadership Computing Facility's Polaris supercomputer, NVIDIA's Selene system, and Cerebras CS-2 wafer-scale chips — enabling training at unprecedented speed and scale for biological sequence models.

Hierarchical pretraining strategy: Pre-trained on 110 million prokaryotic gene sequences to learn general genomic representations, then fine-tuned on 1.5 million SARS-CoV-2 complete genomes, enabling rapid adaptation to viral evolutionary modeling with substantially less target-domain data.

Variant of concern identification: Latent space representations of viral genomes cluster by variant lineage, and the model's attention patterns and embeddings anticipate the fitness advantage of emerging lineages, providing a potential early-warning system for pandemic preparedness.

Codon-level tokenization: Genomes are tokenized at the codon level (triplets of nucleotides) rather than individual bases, reflecting the natural unit of protein-coding information and enabling the model to learn reading-frame-aware representations of coding sequences.

Scalable to multiple pathogen families: While demonstrated on SARS-CoV-2, the general architecture and training approach are applicable to influenza, HIV, and other RNA viruses whose evolutionary dynamics are similarly driven by selection on genomic sequence.

Interpretable evolutionary representations: UMAP visualizations of genome embeddings show clear clustering by variant lineage and temporal ordering that mirrors known evolutionary history, providing qualitative validation of the learned representations.

Technical Details

Applications

Impact

GenSLM

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

GenSLM

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources