LucaVirus

Sun Yat-sen University / University of Sydney

Multimodal viral foundation model over nucleotide and protein sequence, built for virus discovery, function annotation, and antibody design.

Released: June 2025

Viruses are the most abundant and rapidly evolving biological entities on Earth, yet a large fraction of viral sequence in metagenomic data remains unannotated — the so-called "viral dark matter" that escapes recognition by similarity-based tools because it has no close relatives in reference databases. LucaVirus is a multi-modal foundation model built to read this dark matter directly from sequence, learning a unified representation of viral biology that spans both the nucleotide and protein levels.

LucaVirus jointly models viral genomes and the proteins they encode in a single network, rather than treating nucleotide and amino-acid sequences as separate problems. It was developed by researchers at Sun Yat-sen University (Shenzhen Campus), in collaboration with virologist Edward C. Holmes at the University of Sydney, and released as a bioRxiv preprint in June 2025. The model belongs to the broader LucaOne family of biological foundation models and is trained on OpenVirus, a curated corpus assembled to cover nearly all known viral diversity.

By learning from sequence alone across the viral tree of life, LucaVirus is positioned as a general-purpose engine for viral genomics: a single pretrained model that can be applied to virus discovery, functional annotation, evolvability prediction, and antibody-candidate identification without building a bespoke pipeline for each task.

Key Features

Joint nucleotide-protein modeling: A single multi-modal network that represents both viral genome sequence and encoded protein sequence, capturing relationships that separate DNA- or protein-only models miss.
Near-complete viral coverage: Pretrained on OpenVirus, a corpus of 15.7 million non-redundant viral sequences spanning the known virosphere, giving the model broad exposure to extreme viral diversity.
Virus discovery in dark matter: Detects and characterizes divergent viral sequences in metagenomic data that similarity-search tools fail to recognize.
Multi-task downstream utility: Supports enzymatic-activity and domain annotation, protein fitness/evolvability prediction, and antibody-antigen binding prediction from the same pretrained representation.
Parameter efficiency: Reported to match or beat larger competitors on benchmark tasks while using roughly one-third of the parameters.

Technical Details

LucaVirus is a transformer-based foundation model trained with a semi-supervised strategy that combines masked language modeling with several auxiliary biological tasks across nucleotide and protein sequences. Pretraining used the OpenVirus corpus — 25.4 billion tokens in total, comprising 23.7 billion nucleotide tokens from 10.4 million gene sequences and 1.6 billion amino-acid tokens from 5.2 million protein sequences. According to the authors, LucaVirus achieves state-of-the-art performance on three of four evaluated tasks and matches the leading method on the fourth, while using approximately one-third of the parameters of comparable models. Code, trained checkpoints, and the OpenVirus training data are released openly: the GitHub repository (Apache-2.0) provides training and embedding-inference scripts and a documented model-card README, with weights distributed via FTP and Zenodo and the dataset hosted as HuggingFace data cards.

Applications

LucaVirus is aimed at virologists, metagenomics researchers, and computational biologists studying viral diversity and evolution. Typical uses include identifying and annotating novel viruses in environmental and clinical metagenomes, predicting the functional consequences of mutations and the evolvability of viral proteins, annotating enzymatic activity across viral proteomes, and screening for antibody candidates against viral antigens. Because the model operates from sequence and is released with open weights and data, it can be embedded directly into surveillance and discovery pipelines.

Impact

LucaVirus extends the foundation-model paradigm into virology, where rapid divergence and incomplete reference databases have long limited similarity-based methods. By unifying nucleotide and protein modeling and demonstrating competitive accuracy at a fraction of the parameter count, it offers an efficient, broadly applicable tool for studying the virosphere. The involvement of a leading viral evolution group and the open release of code, weights, and the OpenVirus corpus lower the barrier to adoption. As a 2025 preprint its results await peer review, and the auxiliary-FTP weight hosting is an archival consideration, but its open and well-documented release makes it readily testable by the community.

Citation

Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus

Preprint

Pan, Y., et al. (2026) Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus. bioRxiv.

DOI: 10.1101/2025.06.14.659722

Recent citations

Papers that recently cited this model.

Scanning the horizon: deep mutational scanning approaches in virology
Jack Dorman, William Bakhache, Patrick T. Dolan
Journal of Virology · Jun 2026
0
ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks
Dongxin Ye, Fang Hu, Han Hu, et al.
May 2026
0
A sequence-based proactive intelligence for influenza antigenic profiling improves vaccine strain selection
Yihao Chen, Ying Xu, Yan-hui Cheng, et al.
bioRxiv · Apr 2026
0

Top citations

The most-cited papers that cite this model.

NABench: Large-Scale Benchmarks of Nucleotide Foundation Models for Fitness Prediction
Zhongmin Li, Runze Ma, Jia W. Tan, et al.
arXiv.org · Nov 2025
1
An AI for an AI: identifying zoonotic potential of avian influenza viruses via genomic machine learning
Liam Brierley, Joaquin Mould-Quevedo, Matthew Baylis
bioRxiv · Sep 2025
1
Scanning the horizon: deep mutational scanning approaches in virology
Jack Dorman, William Bakhache, Patrick T. Dolan
Journal of Virology · Jun 2026
0
ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks
Dongxin Ye, Fang Hu, Han Hu, et al.
May 2026
0
A sequence-based proactive intelligence for influenza antigenic profiling improves vaccine strain selection
Yihao Chen, Ying Xu, Yan-hui Cheng, et al.
bioRxiv · Apr 2026
0

Citations

Total Citations5

Influential0

References48

GitHub

Stars74

Forks4

Open Issues2

Contributors1

Last Push1mo ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads261

Likes2

Last Modified7d ago

Fields of citing research

Biology100%
Computer Science80%
Medicine60%
Environmental Science40%

Share of papers citing this model.

Openness

bio.rodeo opennessFully open · usable and reproducible

88Open

Usability — can I run it?100

Reproducibility — can I retrain it?86

Model Openness Framework

Class III

Open Model

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Joint nucleotide-protein modeling: A single multi-modal network that represents both viral genome sequence and encoded protein sequence, capturing relationships that separate DNA- or protein-only models miss.

Near-complete viral coverage: Pretrained on OpenVirus, a corpus of 15.7 million non-redundant viral sequences spanning the known virosphere, giving the model broad exposure to extreme viral diversity.

Virus discovery in dark matter: Detects and characterizes divergent viral sequences in metagenomic data that similarity-search tools fail to recognize.

Multi-task downstream utility: Supports enzymatic-activity and domain annotation, protein fitness/evolvability prediction, and antibody-antigen binding prediction from the same pretrained representation.

Parameter efficiency: Reported to match or beat larger competitors on benchmark tasks while using roughly one-third of the parameters.

Technical Details

Applications

Impact

Citation

Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus

Preprint

Pan, Y., et al. (2026) Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus. bioRxiv.

DOI: 10.1101/2025.06.14.659722

LucaVirus

#Key Features

#Technical Details

#Applications

#Impact

Citation

Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

LucaVirus

#Key Features

#Technical Details

#Applications

#Impact

Citation

Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact