vir2vec

Pan-viral genomic language model producing fixed genome-level embeddings of viral DNA and RNA, reused across classification tasks without retraining.

Released: December 2025

Parameters: 422 Million

vir2vec is a pan-viral genomic language model that produces fixed-length, genome-level embeddings of viral DNA and RNA sequences, designed so that a single frozen representation can be reused across many downstream classification tasks without task-specific re-training. It was developed by a team led by Simone Marini at the University of Florida (with collaborators at the University of Pavia and other institutions) and released as a bioRxiv preprint in December 2025. The model targets a persistent gap in viral genomics: most foundation models are trained on human or general microbial sequence and underperform on the extreme diversity, short genomes, and rapid evolution that characterize viruses.

The core idea is continual pretraining. Rather than training from scratch, the authors take Mistral-DNA — a decoder-only genomic language model — and continue pretraining it on a large, curated pan-viral corpus, yielding a model attuned to viral sequence statistics across many families. The resulting embeddings are then fed to simple downstream classifiers (logistic regression, SVM, random forest), making the representation immediately usable by groups without large compute budgets.

Alongside the model, the authors introduce vGUE (Viral Genome Understanding Evaluation), a standardized benchmark for viral representation learning spanning eight heterogeneous tasks, addressing the field's lack of a common yardstick for comparing viral genome embeddings.

Key Features

Reusable frozen embeddings: The model exposes 4,096-dimensional genome-level vectors (via max-pooling over token embeddings) that are computed once and reused across tasks, so downstream classifiers train on precomputed features rather than fine-tuning the backbone.
Continual pretraining strategy: vir2vec adapts the general-purpose Mistral-DNA model to viral sequence space rather than training from scratch, transferring genomic priors while specializing on viral diversity.
Broad viral coverage: Training spans 565,747 complete genomes across 295 viral species drawn from NCBI Virus, BV-BRC, GISAID, LANL HIV database, and HBVdb, covering both DNA and RNA viruses.
vGUE benchmark: A unified evaluation suite of eight tasks, from organism-level discrimination to fine-grained SARS-CoV-2 lineage typing and HIV-1 tropism prediction, enabling apples-to-apples comparison of viral embeddings.
Strong relative performance: vir2vec attains the highest balanced accuracy on seven of eight vGUE tasks, outperforming both a human-trained genomic model and a virus-specific baseline.

Technical Details

vir2vec is a 422-million-parameter, decoder-only transformer built on the Mistral/Mixtral architecture, using mixture-of-experts feed-forward layers, grouped-query attention, and sliding-window attention for efficient long-sequence processing. Sequences are tokenized with a byte-pair encoding scheme adapted to DNA (following the DNABERT-2 tokenizer). The autoregressive decoder is used as a frozen encoder by extracting hidden representations and max-pooling them into a single 4,096-dimensional genome embedding. The training corpus (565,747 genomes, 295 species) was quality-filtered to under 1% ambiguous bases with no runs exceeding 20 consecutive Ns, split 70/30 at the species level, with SARS-CoV-2 and Alphainfluenzavirus down-sampled to 100,000 genomes each to limit over-representation. On the eight vGUE tasks, vir2vec reaches balanced accuracies including 0.98 (virus vs non-virus genomes), 0.97 (virus vs human reads), 0.96 (DNA vs RNA), 0.84 (host prediction), and 1.00 (HIV-1 vs HIV-2), beating the Mistral-DNA-138M human baseline and a ModernBERT-DNA-37M virus-specific baseline on seven of eight tasks with statistically significant margins after Holm correction.

Applications

vir2vec supports a range of viral genomics workflows where a single embedding can power many classifiers: metagenomic virus identification, separating viral reads from human or bacterial background, distinguishing DNA from RNA viruses, predicting host range, differentiating closely related species (e.g., HIV-1 vs HIV-2), typing SARS-CoV-2 lineages, and detecting phenotypic signals such as HIV-1 brain-versus-plasma tropism. Because downstream tasks use lightweight classifiers over precomputed embeddings, surveillance, clinical virology, and evolutionary research groups can adapt the model with modest compute. Model weights are gated, requiring an institutional email, a description of intended use, and an associated IRB protocol number.

Impact

vir2vec contributes both a viral-specialized foundation model and vGUE, a standardized benchmark that the viral genomics field has lacked, giving future viral representation-learning methods a common evaluation framework. By showing that continual pretraining on a curated pan-viral corpus outperforms both human-genome and narrower virus-specific baselines across most tasks, it makes the case for domain-adapted genomic models in pathogen surveillance. The authors deliberately restrict the work to discriminative applications and gate access, noting that generative genome-scale viral models carry inherent biosafety risks and warrant ethical oversight. As a December 2025 preprint, its downstream adoption is still emerging and benchmark comparisons remain to be validated through peer review.

Citation

vir2vec: A Viral Genome-Wide Viral Embedding

Rancati, S., et al. (2025) vir2vec: A Viral Genome-Wide Viral Embedding. bioRxiv.

DOI: 10.64898/2025.12.12.693901

Recent citations

Papers that recently cited this model.

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks
Bo Jing, Kai Zhang, Hongpu Zeng, et al.
Jan 2026
0

Top citations

The most-cited papers that cite this model.

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks
Bo Jing, Kai Zhang, Hongpu Zeng, et al.
Jan 2026
0

Citations

Total Citations1

Influential0

References38

GitHub

Stars3

Forks0

Open Issues0

Contributors1

Last Push7mo ago

LanguageJupyter Notebook

HuggingFace

Downloads411

Likes0

Last Modified1mo ago

Pipelinefeature-extraction

Fields of citing research

Biology100%
Computer Science100%
Medicine100%

Share of papers citing this model.

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

53Partial

Usability — can I run it?54

Reproducibility — can I retrain it?46

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper HuggingFace Model

Key Features

Reusable frozen embeddings: The model exposes 4,096-dimensional genome-level vectors (via max-pooling over token embeddings) that are computed once and reused across tasks, so downstream classifiers train on precomputed features rather than fine-tuning the backbone.

Continual pretraining strategy: vir2vec adapts the general-purpose Mistral-DNA model to viral sequence space rather than training from scratch, transferring genomic priors while specializing on viral diversity.

Broad viral coverage: Training spans 565,747 complete genomes across 295 viral species drawn from NCBI Virus, BV-BRC, GISAID, LANL HIV database, and HBVdb, covering both DNA and RNA viruses.

vGUE benchmark: A unified evaluation suite of eight tasks, from organism-level discrimination to fine-grained SARS-CoV-2 lineage typing and HIV-1 tropism prediction, enabling apples-to-apples comparison of viral embeddings.

Strong relative performance: vir2vec attains the highest balanced accuracy on seven of eight vGUE tasks, outperforming both a human-trained genomic model and a virus-specific baseline.

Technical Details

Applications

Impact

vir2vec

Key Features

Technical Details

Applications

Impact

Citation

vir2vec: A Viral Genome-Wide Viral Embedding

Recent citations

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks

Top citations

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

vir2vec

Key Features

Technical Details

Applications

Impact

Citation

vir2vec: A Viral Genome-Wide Viral Embedding

Recent citations

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks

Top citations

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

vir2vec

#Key Features

#Technical Details

#Applications

#Impact

Citation

vir2vec: A Viral Genome-Wide Viral Embedding

Recent citations

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks

Top citations

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

vir2vec

#Key Features

#Technical Details

#Applications

#Impact

Citation

vir2vec: A Viral Genome-Wide Viral Embedding

Recent citations

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks

Top citations

Classification of SARS-CoV-2 Variants through The Epistatical Circos Plots with Convolutional Neural Networks

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact