University of Oregon / Technical University of Munich
Decoder-only transformer that reframes ancestral recombination graph inference as next-token prediction, estimating coalescence times from genetic variation at scale.
Reconstructing the genealogical history of a sample of genomes — who shares ancestry with whom, and how long ago their lineages last coalesced — is a central problem in population genetics. These histories, encoded as ancestral recombination graphs (ARGs) and the coalescence times along them, underpin inferences about demography, selection, and recombination. State-of-the-art ARG inference has traditionally relied on computationally intensive Markov chain Monte Carlo (MCMC) or heuristic methods that scale poorly to large samples and many loci.
cxt — short for Coalescence and Translation — recasts this problem as a sequence modeling task. Developed by the Andrew D. Kern lab at the University of Oregon with collaborators at the Technical University of Munich and released as a bioRxiv preprint in June 2025, cxt is a decoder-only transformer that treats ARG inference as next-token prediction, "translating" patterns of mutations into estimates of coalescence times. In doing so it brings the language-model paradigm that has reshaped protein and genomic modeling to the inference of evolutionary genealogies.
A defining and unusual feature of cxt is its training regime: rather than learning from empirical sequence data, it is pretrained entirely on coalescent simulations generated with stdpopsim, the community standard library of population-genetic simulation models. This lets the model see a broad range of demographic scenarios with known ground-truth genealogies, and the resulting model transfers to real data without retraining.
cxt is a decoder-only autoregressive transformer trained with a next-token prediction objective. Inputs are patterns of mutations along the genome; outputs are coalescence times, including TMRCA estimates, framed as the tokens to be predicted. Training data come exclusively from coalescent simulations produced with stdpopsim, which supplies diverse demographic models and known genealogies. The trained model matches MCMC accuracy across in- and out-of-distribution scenarios while operating far faster, producing over one million TMRCA estimates in minutes with calibrated posterior distributions. The authors validate transfer by applying the simulation-trained model directly to empirical human and mosquito data without fine-tuning.
cxt is aimed at population and evolutionary geneticists who need fast, calibrated estimates of coalescence times and genealogical structure across large genomic datasets. Use cases include demographic inference, scans for selection, recombination-rate estimation, and any downstream analysis that consumes ARGs or TMRCA distributions. Its speed makes genome-wide, many-sample analyses tractable, and its direct transfer to human and disease-vector (mosquito) data illustrates utility for both human-genetics and ecological/epidemiological genomics.
cxt is part of a wave of work applying transformer language models to genealogical and
population-genetic inference, and it is notable for demonstrating that a model trained
purely on simulations can rival established MCMC pipelines while running orders of
magnitude faster with calibrated uncertainty. This simulation-based training strategy
is both its strength and its main caveat: because cxt never sees empirical sequences
during training, its real-world accuracy depends on how faithfully stdpopsim
simulations capture the demography and biology of the target species. As a preprint,
its results await peer review, but the code and pretrained checkpoints are openly
released under an MIT license at kr-colab/cxt, supporting independent reproduction
and reuse.