bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & GeneProtein

LucaVirus

Sun Yat-sen University / University of Sydney

A multi-modal viral foundation model trained on 25.4B nucleotide and amino-acid tokens spanning nearly all known viruses, for virus discovery, function annotation, and antibody design.

Released: June 2025

Viruses are the most abundant and rapidly evolving biological entities on Earth, yet a large fraction of viral sequence in metagenomic data remains unannotated — the so-called "viral dark matter" that escapes recognition by similarity-based tools because it has no close relatives in reference databases. LucaVirus is a multi-modal foundation model built to read this dark matter directly from sequence, learning a unified representation of viral biology that spans both the nucleotide and protein levels.

LucaVirus jointly models viral genomes and the proteins they encode in a single network, rather than treating nucleotide and amino-acid sequences as separate problems. It was developed by researchers at Sun Yat-sen University (Shenzhen Campus), in collaboration with virologist Edward C. Holmes at the University of Sydney, and released as a bioRxiv preprint in June 2025. The model belongs to the broader LucaOne family of biological foundation models and is trained on OpenVirus, a curated corpus assembled to cover nearly all known viral diversity.

By learning from sequence alone across the viral tree of life, LucaVirus is positioned as a general-purpose engine for viral genomics: a single pretrained model that can be applied to virus discovery, functional annotation, evolvability prediction, and antibody-candidate identification without building a bespoke pipeline for each task.

#Key Features

  • Joint nucleotide-protein modeling: A single multi-modal network that represents both viral genome sequence and encoded protein sequence, capturing relationships that separate DNA- or protein-only models miss.
  • Near-complete viral coverage: Pretrained on OpenVirus, a corpus of 15.7 million non-redundant viral sequences spanning the known virosphere, giving the model broad exposure to extreme viral diversity.
  • Virus discovery in dark matter: Detects and characterizes divergent viral sequences in metagenomic data that similarity-search tools fail to recognize.
  • Multi-task downstream utility: Supports enzymatic-activity and domain annotation, protein fitness/evolvability prediction, and antibody-antigen binding prediction from the same pretrained representation.
  • Parameter efficiency: Reported to match or beat larger competitors on benchmark tasks while using roughly one-third of the parameters.

#Technical Details

LucaVirus is a transformer-based foundation model trained with a semi-supervised strategy that combines masked language modeling with several auxiliary biological tasks across nucleotide and protein sequences. Pretraining used the OpenVirus corpus — 25.4 billion tokens in total, comprising 23.7 billion nucleotide tokens from 10.4 million gene sequences and 1.6 billion amino-acid tokens from 5.2 million protein sequences. According to the authors, LucaVirus achieves state-of-the-art performance on three of four evaluated tasks and matches the leading method on the fourth, while using approximately one-third of the parameters of comparable models. Code, trained checkpoints, and the OpenVirus training data are released openly: the GitHub repository (Apache-2.0) provides training and embedding-inference scripts and a documented model-card README, with weights distributed via FTP and Zenodo and the dataset hosted as HuggingFace data cards.

#Applications

LucaVirus is aimed at virologists, metagenomics researchers, and computational biologists studying viral diversity and evolution. Typical uses include identifying and annotating novel viruses in environmental and clinical metagenomes, predicting the functional consequences of mutations and the evolvability of viral proteins, annotating enzymatic activity across viral proteomes, and screening for antibody candidates against viral antigens. Because the model operates from sequence and is released with open weights and data, it can be embedded directly into surveillance and discovery pipelines.

#Impact

LucaVirus extends the foundation-model paradigm into virology, where rapid divergence and incomplete reference databases have long limited similarity-based methods. By unifying nucleotide and protein modeling and demonstrating competitive accuracy at a fraction of the parameter count, it offers an efficient, broadly applicable tool for studying the virosphere. The involvement of a leading viral evolution group and the open release of code, weights, and the OpenVirus corpus lower the barrier to adoption. As a 2025 preprint its results await peer review, and the auxiliary-FTP weight hosting is an archival consideration, but its open and well-documented release makes it readily testable by the community.

GitHub

Stars69
Forks3
Open Issues1
Contributors1
Last Push14d ago
LanguagePython
LicenseApache-2.0

Openness

bio.rodeo opennessFully open · usable and reproducible
88Open
Usability — can I run it?100
Reproducibility — can I retrain it?86
Model Openness Framework
Class III
Open Model

Tags

virus_discoveryvariant_effect_predictionrepresentation_learningantibody_designtransformerfoundation_modelself_supervisedmultimodalviral_evolutionmetagenomicsgenomics

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset