bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

AntigenLM

Chinese Academy of Sciences / Beijing Institute of Genomics

A structure-aware generative DNA language model pretrained on influenza genomes that forecasts future antigenic variants across regions and subtypes.

Released: February 2026

AntigenLM is a generative DNA language model built to forecast the antigenic evolution of influenza viruses directly from genomic sequence. Predicting which influenza strains will dominate in upcoming seasons is central to vaccine strain selection, and it remains difficult because antigenic change is driven by complex evolutionary pressures on the viral genome. AntigenLM approaches this as a sequence-modeling problem, pretraining on influenza genomes and then fine-tuning on time-series surveillance data to anticipate emerging variants.

The model's distinguishing idea is "structure-aware" DNA language modeling: rather than treating the genome as an undifferentiated string, AntigenLM preserves the integrity of functional genomic units during pretraining. The authors find that disrupting this organization — by fragmenting or shuffling the genome — severely degrades performance, indicating that the model learns evolutionary constraints tied to genome structure rather than surface statistics alone. AntigenLM was introduced by Yue Pei, Xuebin Chi, and Yu Kang (Chinese Academy of Sciences and affiliated institutes) in a February 2026 preprint accepted at ICLR 2026.

It sits within a growing class of genomic language models applied to viral surveillance, alongside influenza-focused efforts such as Influ-BERT, but is notable for its generative formulation and its explicit emphasis on preserving aligned functional units across whole genomes.

#Key Features

  • Structure-aware pretraining: The model is trained on whole-genome influenza sequences with intact, aligned functional units, and the authors show that disrupting this structure markedly reduces forecasting accuracy.
  • Antigenic variant forecasting: After fine-tuning on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM predicts future antigenic variants across geographic regions and subtypes, including some unseen during training.
  • Generative DNA modeling: AntigenLM is formulated as a generative language model over nucleotide sequence rather than a discriminative classifier, enabling it to model the distribution of plausible evolutionary trajectories.
  • Near-perfect subtype classification: The learned representations support accurate influenza subtype classification as a downstream task.
  • Outperforms evolutionary baselines: The model is reported to beat phylogenetic and evolution-based forecasting methods on antigenic variant prediction.

#Technical Details

AntigenLM is a generative DNA language model pretrained on influenza genomes in which functional units are kept aligned and intact, a design the authors argue is essential for capturing evolutionary constraints. For the forecasting task, the pretrained model is fine-tuned on time-series HA and NA sequences, the two surface glycoproteins that drive influenza antigenic drift. Ablations that fragment or shuffle the genomic input substantially degrade performance, which the authors present as evidence that genome structure carries the signal the model relies on. The preprint reports that AntigenLM forecasts future antigenic variants across regions and subtypes — including variants not seen during training — and outperforms phylogenetic and evolution-based baselines, while also achieving near-perfect subtype classification. Exact parameter counts, training corpus size, and full benchmark tables are described in the paper; as of this writing no code or trained weights have been publicly released.

#Applications

AntigenLM targets influenza surveillance and vaccine strain selection, where anticipating antigenically novel variants ahead of a season has direct public-health value. By forecasting variants across regions and subtypes from genomic data, it could support epidemiologists and public-health agencies in prioritizing candidate vaccine strains and in monitoring for emerging drift variants. The subtype-classification capability is additionally useful for routine genomic surveillance pipelines that triage incoming influenza sequences.

#Impact

AntigenLM advances the application of generative genomic language models to real-time viral surveillance, and its central finding — that preserving genome structure is critical to forecasting performance — offers a methodological lesson for DNA language modeling beyond influenza. Acceptance at ICLR 2026 reflects interest from the machine-learning community. As a recent preprint without released code or weights, its reported results have not yet been independently reproduced, and the absence of a public implementation currently limits external adoption and validation.

Tags

variant_effect_predictionsequence_generationtransformerlanguage_modelgenerativefoundation_modelgenomicsdna