GENERator-v2

Beijing Zhongguancun Academy / Mila / Université de Montréal / University of Science and Technology of China / HEC Montréal

Family of autoregressive genomic foundation models that reconcile k-mer tokenization with single-nucleotide resolution at contexts up to 98k bp.

Released: January 2026

GENERator-v2 is a family of autoregressive genomic language models developed by GenerTeam, a collaboration spanning Beijing Zhongguancun Academy, Mila and Université de Montréal, the University of Science and Technology of China (USTC), and HEC Montréal. Released as a bioRxiv preprint in January 2026, it is the successor to GENERator (v1), a long-context generative genomic foundation model from the same group.

The central problem GENERator-v2 addresses is a long-standing tension in genomic language modeling. Single-nucleotide tokenization gives fine-grained resolution but produces very long sequences that are expensive to model, whereas k-mer tokenization compresses sequences for efficiency but blurs the model's ability to reason at the level of individual bases. GENERator-v2 keeps the efficiency of coarse k-mer tokenization (a 6-mer vocabulary) while recovering true single-nucleotide resolution through a training-time reformulation of the loss, allowing it to score variants and generate sequences base by base despite operating on multi-nucleotide tokens.

The release includes domain-specialized variants for both eukaryotic and prokaryotic genomes, spanning the tree of life, and is evaluated both in a training-free (zero-shot) setting and after task-specific fine-tuning. On key generative and probabilistic benchmarks it matches or exceeds Evo2 while being substantially more efficient at inference.

Key Features

Factorized Nucleotide Supervision (FNS): Replaces discrete k-mer supervision with per-nucleotide likelihoods derived from k-mer logits via probability marginalization, recovering single-nucleotide resolution without abandoning efficient k-mer tokenization.
Genome Compression Pretraining (GCP): A data-construction strategy that concatenates gene-centric and regulatory regions while discarding large stretches of low-information background, applied selectively to the eukaryotic models to counter the sparsity of functional signal in eukaryotic genomes.
Tokenization-shift augmentation: Cycles through all k possible token-boundary offsets during training, reducing the model's sensitivity to arbitrary k-mer frame alignment.
Long-context modeling: Supports genomic contexts up to roughly 98k base pairs, with eukaryote- and prokaryote-specialized checkpoints.
Training-free and fine-tuned use: Performs functional in-context learning zero-shot, and can be fine-tuned for domain-specific downstream tasks.

Technical Details

GENERator-v2 uses a LLaMA-style decoder-only transformer with a 6-mer tokenizer (input lengths must be multiples of six). Four base checkpoints are released on HuggingFace: eukaryote and prokaryote variants at approximately 1.2B and 3B parameters each. FNS marginalizes k-mer output logits into nucleotide-level probabilities so the model can be supervised and queried at single-base resolution, while GCP restructures eukaryotic pretraining data to concentrate functional signal. Across generative and probabilistic evaluations, GENERator-v2 consistently improves over the original GENERator and reaches performance comparable to or better than Evo2, at substantially lower inference cost. The HuggingFace model repositories carry MIT-licensed code and substantive model cards with architecture, tokenization, and usage details; the preprint is distributed under CC BY 4.0.

Applications

GENERator-v2 supports genomic researchers working on variant effect prediction, regulatory and gene-centric sequence analysis, and de novo genomic sequence generation across both eukaryotic and prokaryotic systems. Its training-free in-context learning makes it usable for functional prediction tasks without labeled fine-tuning data, while its long context and efficient inference suit whole-locus and multi-gene analyses. Fine-tuned variants extend it to specialized downstream genomics benchmarks.

Impact

GENERator-v2 advances genomic language modeling by demonstrating that coarse k-mer tokenization and single-nucleotide resolution are not mutually exclusive, offering an efficiency-resolution trade-off competitive with single-nucleotide models such as Evo2 at lower inference cost. By open-sourcing four eukaryote and prokaryote checkpoints with documented model cards under a permissive code license, the GenerTeam release lowers the barrier for the community to apply and extend long-context genomic foundation models across domains of life.

Citation

GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling

Li, Q., et al. (2026) GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling. bioRxiv.

DOI: 10.64898/2026.01.27.702015

Recent citations

Papers that recently cited this model.

Carbon: Decoding the Language of Life
Loubna Ben Allal, Qiuyi Li, Maurizio Fiusco, et al.
bioRxiv · May 2026
0Influential

Top citations

The most-cited papers that cite this model.

Carbon: Decoding the Language of Life
Loubna Ben Allal, Qiuyi Li, Maurizio Fiusco, et al.
bioRxiv · May 2026
0Influential

Citations

Total Citations1

Influential0

References44

GitHub

Stars460

Forks75

Open Issues0

Contributors4

Last Push13d ago

LanguagePython

LicenseMIT

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

86Open

Usability — can I run it?100

Reproducibility — can I retrain it?70

Model Openness Framework

Class II

Open Tooling

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Factorized Nucleotide Supervision (FNS): Replaces discrete k-mer supervision with per-nucleotide likelihoods derived from k-mer logits via probability marginalization, recovering single-nucleotide resolution without abandoning efficient k-mer tokenization.

Genome Compression Pretraining (GCP): A data-construction strategy that concatenates gene-centric and regulatory regions while discarding large stretches of low-information background, applied selectively to the eukaryotic models to counter the sparsity of functional signal in eukaryotic genomes.

Tokenization-shift augmentation: Cycles through all k possible token-boundary offsets during training, reducing the model's sensitivity to arbitrary k-mer frame alignment.

Long-context modeling: Supports genomic contexts up to roughly 98k base pairs, with eukaryote- and prokaryote-specialized checkpoints.

Training-free and fine-tuned use: Performs functional in-context learning zero-shot, and can be fine-tuned for domain-specific downstream tasks.

Technical Details

Applications

Impact

GENERator-v2

Key Features

Technical Details

Applications

Impact

Citation

GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling

Recent citations

Carbon: Decoding the Language of Life

Top citations

Carbon: Decoding the Language of Life

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

GENERator-v2

Key Features

Technical Details

Applications

Impact

Citation

GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling

Recent citations

Carbon: Decoding the Language of Life

Top citations

Carbon: Decoding the Language of Life

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

GENERator-v2

#Key Features

#Technical Details

#Applications

#Impact

Citation

GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling

Recent citations

Carbon: Decoding the Language of Life

Top citations

Carbon: Decoding the Language of Life

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

GENERator-v2

#Key Features

#Technical Details

#Applications

#Impact

Citation

GENERator-v2: Reconciling Coarse Tokenization with Single-Nucleotide Resolution in Genomic Language Modeling

Recent citations

Carbon: Decoding the Language of Life

Top citations

Carbon: Decoding the Language of Life

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact