Self-supervised transformer for population genetics, pretrained on 1000 Genomes data, that learns selection signatures via site- and haplotype-wise attention.
Popformer is a self-supervised transformer for population genetics, introduced by Leon Zong, Sorelle A. Friedler, and Sara Mathieson in a March 2026 bioRxiv preprint. The model brings the pretraining-then-finetuning paradigm that transformed protein and genomic sequence modeling into population-scale analysis, where the unit of study is not a single sequence but a panel of haplotypes sampled across many individuals. It is among the first population-genetics foundation models, learning general representations of genetic variation that can be reused across downstream evolutionary inference tasks.
The central problem Popformer addresses is detecting signatures of positive selection—genomic regions where particular variants have risen in frequency faster than neutral expectation. Traditional approaches rely on hand-engineered summary statistics or supervised classifiers trained on simulations under a specific demographic model, which can fail when the assumed demography is misspecified. Popformer instead pretrains on real human data and learns representations that transfer, remaining accurate even when the downstream selection classifier is evaluated under mis-specified demographic scenarios.
Popformer is a transformer architecture adapted to operate on haplotype matrices, combining site-wise attention (across SNP positions) with haplotype-wise attention (across sampled individuals). Pretraining uses a masked-modeling objective on real human genomic data from the 1000 Genomes Project, conceptually analogous to genetic imputation: the model learns to reconstruct masked genotypes from surrounding context. The resulting embeddings of genomic windows align with known population structure in a zero-shot setting. For selection detection, the pretrained encoder is fine-tuned as a classifier and benchmarked against specialized selection-scan methods on simulations spanning both correctly specified and mis-specified demographic histories, where it reports higher accuracy.
Popformer is intended for population geneticists and evolutionary biologists who study natural selection, demographic history, and the structure of human genetic variation. Beyond selection scans, the authors point to future applications such as inferring recombination rates and local ancestry, leveraging the same pretrained backbone. Because the model learns transferable representations, it can serve as a shared starting point for multiple population-genomic inference tasks rather than requiring a bespoke estimator for each.
By demonstrating that a self-supervised transformer pretrained on real human genomes can capture population structure zero-shot and improve selection inference under model misspecification, Popformer extends foundation-model methodology into a field that has historically depended on simulation-trained, demography-specific estimators. As an early population-genetics foundation model, it charts a path toward reusable representations for evolutionary inference. As a recent preprint without a confirmed public code release or pretrained weights, its broader adoption and independent benchmarking remain to be seen.