cxt (Coalescence and Translation LM)

University of Oregon / Technical University of Munich

Decoder-only transformer that recasts ancestral recombination graph inference as next-token prediction, estimating coalescence times from variation.

Released: June 2025

Reconstructing the genealogical history of a sample of genomes — who shares ancestry with whom, and how long ago their lineages last coalesced — is a central problem in population genetics. These histories, encoded as ancestral recombination graphs (ARGs) and the coalescence times along them, underpin inferences about demography, selection, and recombination. State-of-the-art ARG inference has traditionally relied on computationally intensive Markov chain Monte Carlo (MCMC) or heuristic methods that scale poorly to large samples and many loci.

cxt — short for Coalescence and Translation — recasts this problem as a sequence modeling task. Developed by the Andrew D. Kern lab at the University of Oregon with collaborators at the Technical University of Munich and released as a bioRxiv preprint in June 2025, cxt is a decoder-only transformer that treats ARG inference as next-token prediction, "translating" patterns of mutations into estimates of coalescence times. In doing so it brings the language-model paradigm that has reshaped protein and genomic modeling to the inference of evolutionary genealogies.

A defining and unusual feature of cxt is its training regime: rather than learning from empirical sequence data, it is pretrained entirely on coalescent simulations generated with stdpopsim, the community standard library of population-genetic simulation models. This lets the model see a broad range of demographic scenarios with known ground-truth genealogies, and the resulting model transfers to real data without retraining.

Key Features

ARG inference as next-token prediction: cxt frames the inference of coalescence times from observed mutations as a translation/language-modeling problem, handled by a decoder-only transformer.
Simulation-trained, no empirical labels: The model is pretrained on stdpopsim coalescent simulations, giving it ground-truth genealogies across many demographic scenarios without needing labeled empirical sequences.
Robust across distributions: cxt performs on par with MCMC-based methods across both in-distribution and out-of-distribution demographic scenarios, indicating generalization beyond its training regime.
Direct transfer to real data: It is applied to human and mosquito empirical datasets without any retraining, demonstrating cross-organism and simulation-to-reality transfer.
Calibrated, high-throughput posteriors: cxt generates more than one million TMRCA (time to most recent common ancestor) estimates in minutes, each with calibrated posterior uncertainty.

Technical Details

cxt is a decoder-only autoregressive transformer trained with a next-token prediction objective. Inputs are patterns of mutations along the genome; outputs are coalescence times, including TMRCA estimates, framed as the tokens to be predicted. Training data come exclusively from coalescent simulations produced with stdpopsim, which supplies diverse demographic models and known genealogies. The trained model matches MCMC accuracy across in- and out-of-distribution scenarios while operating far faster, producing over one million TMRCA estimates in minutes with calibrated posterior distributions. The authors validate transfer by applying the simulation-trained model directly to empirical human and mosquito data without fine-tuning.

Applications

cxt is aimed at population and evolutionary geneticists who need fast, calibrated estimates of coalescence times and genealogical structure across large genomic datasets. Use cases include demographic inference, scans for selection, recombination-rate estimation, and any downstream analysis that consumes ARGs or TMRCA distributions. Its speed makes genome-wide, many-sample analyses tractable, and its direct transfer to human and disease-vector (mosquito) data illustrates utility for both human-genetics and ecological/epidemiological genomics.

Impact

cxt is part of a wave of work applying transformer language models to genealogical and population-genetic inference, and it is notable for demonstrating that a model trained purely on simulations can rival established MCMC pipelines while running orders of magnitude faster with calibrated uncertainty. This simulation-based training strategy is both its strength and its main caveat: because cxt never sees empirical sequences during training, its real-world accuracy depends on how faithfully stdpopsim simulations capture the demography and biology of the target species. As a preprint, its results await peer review, but the code and pretrained checkpoints are openly released under an MIT license at kr-colab/cxt, supporting independent reproduction and reuse.

Citation

Coalescence and Translation

Preprint

Korfmann, K., et al. (2025) Coalescence and Translation. bioRxiv.

DOI: 10.1101/2025.06.24.661337

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References70

GitHub

Stars7

Forks1

Open Issues1

Contributors3

Last Push4mo ago

LanguageTeX

LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

83Open

Usability — can I run it?91

Reproducibility — can I retrain it?83

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper

Key Features

ARG inference as next-token prediction: cxt frames the inference of coalescence times from observed mutations as a translation/language-modeling problem, handled by a decoder-only transformer.

Simulation-trained, no empirical labels: The model is pretrained on stdpopsim coalescent simulations, giving it ground-truth genealogies across many demographic scenarios without needing labeled empirical sequences.

Robust across distributions: cxt performs on par with MCMC-based methods across both in-distribution and out-of-distribution demographic scenarios, indicating generalization beyond its training regime.

Direct transfer to real data: It is applied to human and mosquito empirical datasets without any retraining, demonstrating cross-organism and simulation-to-reality transfer.

Calibrated, high-throughput posteriors: cxt generates more than one million TMRCA (time to most recent common ancestor) estimates in minutes, each with calibrated posterior uncertainty.

Technical Details

Applications

Impact

cxt (Coalescence and Translation LM)

Key Features

Technical Details

Applications

Impact

Citation

Coalescence and Translation

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

cxt (Coalescence and Translation LM)

Key Features

Technical Details

Applications

Impact

Citation

Coalescence and Translation

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

cxt (Coalescence and Translation LM)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Coalescence and Translation

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

cxt (Coalescence and Translation LM)

#Key Features

#Technical Details

#Applications

#Impact

Citation

Coalescence and Translation

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact