Beijing Zhongguancun Academy / Mila / Université de Montréal / University of Science and Technology of China / HEC Montréal
A family of autoregressive genomic foundation models that reconcile k-mer tokenization with single-nucleotide resolution at contexts up to 98k bp.
GENERator-v2 is a family of autoregressive genomic language models developed by GenerTeam, a collaboration spanning Beijing Zhongguancun Academy, Mila and Université de Montréal, the University of Science and Technology of China (USTC), and HEC Montréal. Released as a bioRxiv preprint in January 2026, it is the successor to GENERator (v1), a long-context generative genomic foundation model from the same group.
The central problem GENERator-v2 addresses is a long-standing tension in genomic language modeling. Single-nucleotide tokenization gives fine-grained resolution but produces very long sequences that are expensive to model, whereas k-mer tokenization compresses sequences for efficiency but blurs the model's ability to reason at the level of individual bases. GENERator-v2 keeps the efficiency of coarse k-mer tokenization (a 6-mer vocabulary) while recovering true single-nucleotide resolution through a training-time reformulation of the loss, allowing it to score variants and generate sequences base by base despite operating on multi-nucleotide tokens.
The release includes domain-specialized variants for both eukaryotic and prokaryotic genomes, spanning the tree of life, and is evaluated both in a training-free (zero-shot) setting and after task-specific fine-tuning. On key generative and probabilistic benchmarks it matches or exceeds Evo2 while being substantially more efficient at inference.
GENERator-v2 uses a LLaMA-style decoder-only transformer with a 6-mer tokenizer (input lengths must be multiples of six). Four base checkpoints are released on HuggingFace: eukaryote and prokaryote variants at approximately 1.2B and 3B parameters each. FNS marginalizes k-mer output logits into nucleotide-level probabilities so the model can be supervised and queried at single-base resolution, while GCP restructures eukaryotic pretraining data to concentrate functional signal. Across generative and probabilistic evaluations, GENERator-v2 consistently improves over the original GENERator and reaches performance comparable to or better than Evo2, at substantially lower inference cost. The HuggingFace model repositories carry MIT-licensed code and substantive model cards with architecture, tokenization, and usage details; the preprint is distributed under CC BY 4.0.
GENERator-v2 supports genomic researchers working on variant effect prediction, regulatory and gene-centric sequence analysis, and de novo genomic sequence generation across both eukaryotic and prokaryotic systems. Its training-free in-context learning makes it usable for functional prediction tasks without labeled fine-tuning data, while its long context and efficient inference suit whole-locus and multi-gene analyses. Fine-tuned variants extend it to specialized downstream genomics benchmarks.
GENERator-v2 advances genomic language modeling by demonstrating that coarse k-mer tokenization and single-nucleotide resolution are not mutually exclusive, offering an efficiency-resolution trade-off competitive with single-nucleotide models such as Evo2 at lower inference cost. By open-sourcing four eukaryote and prokaryote checkpoints with documented model cards under a permissive code license, the GenerTeam release lowers the barrier for the community to apply and extend long-context genomic foundation models across domains of life.
Li, Q., et al. (2026) GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling. bioRxiv.
DOI: 10.64898/2026.01.27.702015