bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

vir2vec

University of Florida

A 422M-parameter pan-viral genomic language model that produces fixed genome-level embeddings reused across viral classification tasks without re-training.

Released: December 2025
Parameters: 422 Million

vir2vec is a pan-viral genomic language model that produces fixed-length, genome-level embeddings of viral DNA and RNA sequences, designed so that a single frozen representation can be reused across many downstream classification tasks without task-specific re-training. It was developed by a team led by Simone Marini at the University of Florida (with collaborators at the University of Pavia and other institutions) and released as a bioRxiv preprint in December 2025. The model targets a persistent gap in viral genomics: most foundation models are trained on human or general microbial sequence and underperform on the extreme diversity, short genomes, and rapid evolution that characterize viruses.

The core idea is continual pretraining. Rather than training from scratch, the authors take Mistral-DNA — a decoder-only genomic language model — and continue pretraining it on a large, curated pan-viral corpus, yielding a model attuned to viral sequence statistics across many families. The resulting embeddings are then fed to simple downstream classifiers (logistic regression, SVM, random forest), making the representation immediately usable by groups without large compute budgets.

Alongside the model, the authors introduce vGUE (Viral Genome Understanding Evaluation), a standardized benchmark for viral representation learning spanning eight heterogeneous tasks, addressing the field's lack of a common yardstick for comparing viral genome embeddings.

#Key Features

  • Reusable frozen embeddings: The model exposes 4,096-dimensional genome-level vectors (via max-pooling over token embeddings) that are computed once and reused across tasks, so downstream classifiers train on precomputed features rather than fine-tuning the backbone.
  • Continual pretraining strategy: vir2vec adapts the general-purpose Mistral-DNA model to viral sequence space rather than training from scratch, transferring genomic priors while specializing on viral diversity.
  • Broad viral coverage: Training spans 565,747 complete genomes across 295 viral species drawn from NCBI Virus, BV-BRC, GISAID, LANL HIV database, and HBVdb, covering both DNA and RNA viruses.
  • vGUE benchmark: A unified evaluation suite of eight tasks, from organism-level discrimination to fine-grained SARS-CoV-2 lineage typing and HIV-1 tropism prediction, enabling apples-to-apples comparison of viral embeddings.
  • Strong relative performance: vir2vec attains the highest balanced accuracy on seven of eight vGUE tasks, outperforming both a human-trained genomic model and a virus-specific baseline.

#Technical Details

vir2vec is a 422-million-parameter, decoder-only transformer built on the Mistral/Mixtral architecture, using mixture-of-experts feed-forward layers, grouped-query attention, and sliding-window attention for efficient long-sequence processing. Sequences are tokenized with a byte-pair encoding scheme adapted to DNA (following the DNABERT-2 tokenizer). The autoregressive decoder is used as a frozen encoder by extracting hidden representations and max-pooling them into a single 4,096-dimensional genome embedding. The training corpus (565,747 genomes, 295 species) was quality-filtered to under 1% ambiguous bases with no runs exceeding 20 consecutive Ns, split 70/30 at the species level, with SARS-CoV-2 and Alphainfluenzavirus down-sampled to 100,000 genomes each to limit over-representation. On the eight vGUE tasks, vir2vec reaches balanced accuracies including 0.98 (virus vs non-virus genomes), 0.97 (virus vs human reads), 0.96 (DNA vs RNA), 0.84 (host prediction), and 1.00 (HIV-1 vs HIV-2), beating the Mistral-DNA-138M human baseline and a ModernBERT-DNA-37M virus-specific baseline on seven of eight tasks with statistically significant margins after Holm correction.

#Applications

vir2vec supports a range of viral genomics workflows where a single embedding can power many classifiers: metagenomic virus identification, separating viral reads from human or bacterial background, distinguishing DNA from RNA viruses, predicting host range, differentiating closely related species (e.g., HIV-1 vs HIV-2), typing SARS-CoV-2 lineages, and detecting phenotypic signals such as HIV-1 brain-versus-plasma tropism. Because downstream tasks use lightweight classifiers over precomputed embeddings, surveillance, clinical virology, and evolutionary research groups can adapt the model with modest compute. Model weights are gated, requiring an institutional email, a description of intended use, and an associated IRB protocol number.

#Impact

vir2vec contributes both a viral-specialized foundation model and vGUE, a standardized benchmark that the viral genomics field has lacked, giving future viral representation-learning methods a common evaluation framework. By showing that continual pretraining on a curated pan-viral corpus outperforms both human-genome and narrower virus-specific baselines across most tasks, it makes the case for domain-adapted genomic models in pathogen surveillance. The authors deliberately restrict the work to discriminative applications and gate access, noting that generative genome-scale viral models carry inherent biosafety risks and warrant ethical oversight. As a December 2025 preprint, its downstream adoption is still emerging and benchmark comparisons remain to be validated through peer review.

Citation

DOI: 10.64898/2025.12.12.693901

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
53Partial
Usability — can I run it?54
Reproducibility — can I retrain it?46
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

embeddingsfoundation_modelgenomicslanguage_modelmixture_of_expertssequence_classificationtransformervariant_effect_predictionvirology

Resources

GitHub RepositoryResearch PaperHuggingFace Model