AntigenLM

Chinese Academy of Sciences / Beijing Institute of Genomics

Structure-aware generative DNA language model pretrained on influenza genomes that forecasts future antigenic variants across regions and subtypes.

Released: February 2026

AntigenLM is a generative DNA language model built to forecast the antigenic evolution of influenza viruses directly from genomic sequence. Predicting which influenza strains will dominate in upcoming seasons is central to vaccine strain selection, and it remains difficult because antigenic change is driven by complex evolutionary pressures on the viral genome. AntigenLM approaches this as a sequence-modeling problem, pretraining on influenza genomes and then fine-tuning on time-series surveillance data to anticipate emerging variants.

The model's distinguishing idea is "structure-aware" DNA language modeling: rather than treating the genome as an undifferentiated string, AntigenLM preserves the integrity of functional genomic units during pretraining. The authors find that disrupting this organization — by fragmenting or shuffling the genome — severely degrades performance, indicating that the model learns evolutionary constraints tied to genome structure rather than surface statistics alone. AntigenLM was introduced by Yue Pei, Xuebin Chi, and Yu Kang (Chinese Academy of Sciences and affiliated institutes) in a February 2026 preprint accepted at ICLR 2026.

It sits within a growing class of genomic language models applied to viral surveillance, alongside influenza-focused efforts such as Influ-BERT, but is notable for its generative formulation and its explicit emphasis on preserving aligned functional units across whole genomes.

Key Features

Structure-aware pretraining: The model is trained on whole-genome influenza sequences with intact, aligned functional units, and the authors show that disrupting this structure markedly reduces forecasting accuracy.
Antigenic variant forecasting: After fine-tuning on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM predicts future antigenic variants across geographic regions and subtypes, including some unseen during training.
Generative DNA modeling: AntigenLM is formulated as a generative language model over nucleotide sequence rather than a discriminative classifier, enabling it to model the distribution of plausible evolutionary trajectories.
Near-perfect subtype classification: The learned representations support accurate influenza subtype classification as a downstream task.
Outperforms evolutionary baselines: The model is reported to beat phylogenetic and evolution-based forecasting methods on antigenic variant prediction.

Technical Details

AntigenLM is a generative DNA language model pretrained on influenza genomes in which functional units are kept aligned and intact, a design the authors argue is essential for capturing evolutionary constraints. For the forecasting task, the pretrained model is fine-tuned on time-series HA and NA sequences, the two surface glycoproteins that drive influenza antigenic drift. Ablations that fragment or shuffle the genomic input substantially degrade performance, which the authors present as evidence that genome structure carries the signal the model relies on. The preprint reports that AntigenLM forecasts future antigenic variants across regions and subtypes — including variants not seen during training — and outperforms phylogenetic and evolution-based baselines, while also achieving near-perfect subtype classification. Exact parameter counts, training corpus size, and full benchmark tables are described in the paper; as of this writing no code or trained weights have been publicly released.

Applications

AntigenLM targets influenza surveillance and vaccine strain selection, where anticipating antigenically novel variants ahead of a season has direct public-health value. By forecasting variants across regions and subtypes from genomic data, it could support epidemiologists and public-health agencies in prioritizing candidate vaccine strains and in monitoring for emerging drift variants. The subtype-classification capability is additionally useful for routine genomic surveillance pipelines that triage incoming influenza sequences.

Impact

AntigenLM advances the application of generative genomic language models to real-time viral surveillance, and its central finding — that preserving genome structure is critical to forecasting performance — offers a methodological lesson for DNA language modeling beyond influenza. Acceptance at ICLR 2026 reflects interest from the machine-learning community. As a recent preprint without released code or weights, its reported results have not yet been independently reproduced, and the absence of a public implementation currently limits external adoption and validation.

Citation

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Preprint

Pei, Y., et al. (2026) AntigenLM: Structure-Aware DNA Language Modeling for Influenza. arXiv.org.

DOI: 10.48550/arXiv.2602.09067

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References51

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

5Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

Structure-aware pretraining: The model is trained on whole-genome influenza sequences with intact, aligned functional units, and the authors show that disrupting this structure markedly reduces forecasting accuracy.

Antigenic variant forecasting: After fine-tuning on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM predicts future antigenic variants across geographic regions and subtypes, including some unseen during training.

Generative DNA modeling: AntigenLM is formulated as a generative language model over nucleotide sequence rather than a discriminative classifier, enabling it to model the distribution of plausible evolutionary trajectories.

Near-perfect subtype classification: The learned representations support accurate influenza subtype classification as a downstream task.

Outperforms evolutionary baselines: The model is reported to beat phylogenetic and evolution-based forecasting methods on antigenic variant prediction.

Technical Details

Applications

Impact

AntigenLM

Key Features

Technical Details

Applications

Impact

Citation

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

AntigenLM

Key Features

Technical Details

Applications

Impact

Citation

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

AntigenLM

#Key Features

#Technical Details

#Applications

#Impact

Citation

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

AntigenLM

#Key Features

#Technical Details

#Applications

#Impact

Citation

AntigenLM: Structure-Aware DNA Language Modeling for Influenza

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact