bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

cxt (Coalescence and Translation LM)

University of Oregon / Technical University of Munich

Decoder-only transformer that reframes ancestral recombination graph inference as next-token prediction, estimating coalescence times from genetic variation at scale.

Released: June 2025

Reconstructing the genealogical history of a sample of genomes — who shares ancestry with whom, and how long ago their lineages last coalesced — is a central problem in population genetics. These histories, encoded as ancestral recombination graphs (ARGs) and the coalescence times along them, underpin inferences about demography, selection, and recombination. State-of-the-art ARG inference has traditionally relied on computationally intensive Markov chain Monte Carlo (MCMC) or heuristic methods that scale poorly to large samples and many loci.

cxt — short for Coalescence and Translation — recasts this problem as a sequence modeling task. Developed by the Andrew D. Kern lab at the University of Oregon with collaborators at the Technical University of Munich and released as a bioRxiv preprint in June 2025, cxt is a decoder-only transformer that treats ARG inference as next-token prediction, "translating" patterns of mutations into estimates of coalescence times. In doing so it brings the language-model paradigm that has reshaped protein and genomic modeling to the inference of evolutionary genealogies.

A defining and unusual feature of cxt is its training regime: rather than learning from empirical sequence data, it is pretrained entirely on coalescent simulations generated with stdpopsim, the community standard library of population-genetic simulation models. This lets the model see a broad range of demographic scenarios with known ground-truth genealogies, and the resulting model transfers to real data without retraining.

#Key Features

  • ARG inference as next-token prediction: cxt frames the inference of coalescence times from observed mutations as a translation/language-modeling problem, handled by a decoder-only transformer.
  • Simulation-trained, no empirical labels: The model is pretrained on stdpopsim coalescent simulations, giving it ground-truth genealogies across many demographic scenarios without needing labeled empirical sequences.
  • Robust across distributions: cxt performs on par with MCMC-based methods across both in-distribution and out-of-distribution demographic scenarios, indicating generalization beyond its training regime.
  • Direct transfer to real data: It is applied to human and mosquito empirical datasets without any retraining, demonstrating cross-organism and simulation-to-reality transfer.
  • Calibrated, high-throughput posteriors: cxt generates more than one million TMRCA (time to most recent common ancestor) estimates in minutes, each with calibrated posterior uncertainty.

#Technical Details

cxt is a decoder-only autoregressive transformer trained with a next-token prediction objective. Inputs are patterns of mutations along the genome; outputs are coalescence times, including TMRCA estimates, framed as the tokens to be predicted. Training data come exclusively from coalescent simulations produced with stdpopsim, which supplies diverse demographic models and known genealogies. The trained model matches MCMC accuracy across in- and out-of-distribution scenarios while operating far faster, producing over one million TMRCA estimates in minutes with calibrated posterior distributions. The authors validate transfer by applying the simulation-trained model directly to empirical human and mosquito data without fine-tuning.

#Applications

cxt is aimed at population and evolutionary geneticists who need fast, calibrated estimates of coalescence times and genealogical structure across large genomic datasets. Use cases include demographic inference, scans for selection, recombination-rate estimation, and any downstream analysis that consumes ARGs or TMRCA distributions. Its speed makes genome-wide, many-sample analyses tractable, and its direct transfer to human and disease-vector (mosquito) data illustrates utility for both human-genetics and ecological/epidemiological genomics.

#Impact

cxt is part of a wave of work applying transformer language models to genealogical and population-genetic inference, and it is notable for demonstrating that a model trained purely on simulations can rival established MCMC pipelines while running orders of magnitude faster with calibrated uncertainty. This simulation-based training strategy is both its strength and its main caveat: because cxt never sees empirical sequences during training, its real-world accuracy depends on how faithfully stdpopsim simulations capture the demography and biology of the target species. As a preprint, its results await peer review, but the code and pretrained checkpoints are openly released under an MIT license at kr-colab/cxt, supporting independent reproduction and reuse.

GitHub

Stars6
Forks0
Open Issues1
Contributors3
Last Push2mo ago
LanguageTeX
LicenseMIT

Openness

bio.rodeo opennessFully open · usable and reproducible
83Open
Usability — can I run it?91
Reproducibility — can I retrain it?83
Model Openness Framework
Unclassified
Missing required components

Tags

ancestral_recombination_graph_inferencecoalescence_time_estimationtransformerlanguage_modelself_supervisedpopulation_geneticsgenomics

Resources

GitHub RepositoryResearch Paper