Sun Yat-sen University / University of Sydney
A multi-modal viral foundation model trained on 25.4B nucleotide and amino-acid tokens spanning nearly all known viruses, for virus discovery, function annotation, and antibody design.
Viruses are the most abundant and rapidly evolving biological entities on Earth, yet a large fraction of viral sequence in metagenomic data remains unannotated — the so-called "viral dark matter" that escapes recognition by similarity-based tools because it has no close relatives in reference databases. LucaVirus is a multi-modal foundation model built to read this dark matter directly from sequence, learning a unified representation of viral biology that spans both the nucleotide and protein levels.
LucaVirus jointly models viral genomes and the proteins they encode in a single network, rather than treating nucleotide and amino-acid sequences as separate problems. It was developed by researchers at Sun Yat-sen University (Shenzhen Campus), in collaboration with virologist Edward C. Holmes at the University of Sydney, and released as a bioRxiv preprint in June 2025. The model belongs to the broader LucaOne family of biological foundation models and is trained on OpenVirus, a curated corpus assembled to cover nearly all known viral diversity.
By learning from sequence alone across the viral tree of life, LucaVirus is positioned as a general-purpose engine for viral genomics: a single pretrained model that can be applied to virus discovery, functional annotation, evolvability prediction, and antibody-candidate identification without building a bespoke pipeline for each task.
LucaVirus is a transformer-based foundation model trained with a semi-supervised strategy that combines masked language modeling with several auxiliary biological tasks across nucleotide and protein sequences. Pretraining used the OpenVirus corpus — 25.4 billion tokens in total, comprising 23.7 billion nucleotide tokens from 10.4 million gene sequences and 1.6 billion amino-acid tokens from 5.2 million protein sequences. According to the authors, LucaVirus achieves state-of-the-art performance on three of four evaluated tasks and matches the leading method on the fourth, while using approximately one-third of the parameters of comparable models. Code, trained checkpoints, and the OpenVirus training data are released openly: the GitHub repository (Apache-2.0) provides training and embedding-inference scripts and a documented model-card README, with weights distributed via FTP and Zenodo and the dataset hosted as HuggingFace data cards.
LucaVirus is aimed at virologists, metagenomics researchers, and computational biologists studying viral diversity and evolution. Typical uses include identifying and annotating novel viruses in environmental and clinical metagenomes, predicting the functional consequences of mutations and the evolvability of viral proteins, annotating enzymatic activity across viral proteomes, and screening for antibody candidates against viral antigens. Because the model operates from sequence and is released with open weights and data, it can be embedded directly into surveillance and discovery pipelines.
LucaVirus extends the foundation-model paradigm into virology, where rapid divergence and incomplete reference databases have long limited similarity-based methods. By unifying nucleotide and protein modeling and demonstrating competitive accuracy at a fraction of the parameter count, it offers an efficient, broadly applicable tool for studying the virosphere. The involvement of a leading viral evolution group and the open release of code, weights, and the OpenVirus corpus lower the barrier to adoption. As a 2025 preprint its results await peer review, and the auxiliary-FTP weight hosting is an archival consideration, but its open and well-documented release makes it readily testable by the community.