A 422M-parameter pan-viral genomic language model that produces fixed genome-level embeddings reused across viral classification tasks without re-training.
vir2vec is a pan-viral genomic language model that produces fixed-length, genome-level embeddings of viral DNA and RNA sequences, designed so that a single frozen representation can be reused across many downstream classification tasks without task-specific re-training. It was developed by a team led by Simone Marini at the University of Florida (with collaborators at the University of Pavia and other institutions) and released as a bioRxiv preprint in December 2025. The model targets a persistent gap in viral genomics: most foundation models are trained on human or general microbial sequence and underperform on the extreme diversity, short genomes, and rapid evolution that characterize viruses.
The core idea is continual pretraining. Rather than training from scratch, the authors take Mistral-DNA — a decoder-only genomic language model — and continue pretraining it on a large, curated pan-viral corpus, yielding a model attuned to viral sequence statistics across many families. The resulting embeddings are then fed to simple downstream classifiers (logistic regression, SVM, random forest), making the representation immediately usable by groups without large compute budgets.
Alongside the model, the authors introduce vGUE (Viral Genome Understanding Evaluation), a standardized benchmark for viral representation learning spanning eight heterogeneous tasks, addressing the field's lack of a common yardstick for comparing viral genome embeddings.
vir2vec is a 422-million-parameter, decoder-only transformer built on the Mistral/Mixtral architecture, using mixture-of-experts feed-forward layers, grouped-query attention, and sliding-window attention for efficient long-sequence processing. Sequences are tokenized with a byte-pair encoding scheme adapted to DNA (following the DNABERT-2 tokenizer). The autoregressive decoder is used as a frozen encoder by extracting hidden representations and max-pooling them into a single 4,096-dimensional genome embedding. The training corpus (565,747 genomes, 295 species) was quality-filtered to under 1% ambiguous bases with no runs exceeding 20 consecutive Ns, split 70/30 at the species level, with SARS-CoV-2 and Alphainfluenzavirus down-sampled to 100,000 genomes each to limit over-representation. On the eight vGUE tasks, vir2vec reaches balanced accuracies including 0.98 (virus vs non-virus genomes), 0.97 (virus vs human reads), 0.96 (DNA vs RNA), 0.84 (host prediction), and 1.00 (HIV-1 vs HIV-2), beating the Mistral-DNA-138M human baseline and a ModernBERT-DNA-37M virus-specific baseline on seven of eight tasks with statistically significant margins after Holm correction.
vir2vec supports a range of viral genomics workflows where a single embedding can power many classifiers: metagenomic virus identification, separating viral reads from human or bacterial background, distinguishing DNA from RNA viruses, predicting host range, differentiating closely related species (e.g., HIV-1 vs HIV-2), typing SARS-CoV-2 lineages, and detecting phenotypic signals such as HIV-1 brain-versus-plasma tropism. Because downstream tasks use lightweight classifiers over precomputed embeddings, surveillance, clinical virology, and evolutionary research groups can adapt the model with modest compute. Model weights are gated, requiring an institutional email, a description of intended use, and an associated IRB protocol number.
vir2vec contributes both a viral-specialized foundation model and vGUE, a standardized benchmark that the viral genomics field has lacked, giving future viral representation-learning methods a common evaluation framework. By showing that continual pretraining on a curated pan-viral corpus outperforms both human-genome and narrower virus-specific baselines across most tasks, it makes the case for domain-adapted genomic models in pathogen surveillance. The authors deliberately restrict the work to discriminative applications and gate access, noting that generative genome-scale viral models carry inherent biosafety risks and warrant ethical oversight. As a December 2025 preprint, its downstream adoption is still emerging and benchmark comparisons remain to be validated through peer review.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data