Technion – Israel Institute of Technology / Tel Aviv University / Kempner Institute
Generative transformer framework that infers phylogenetic trees by transducing sets of unaligned molecular sequences directly into Newick-format trees.
Phylogenetic tree inference — reconstructing the evolutionary relationships among a set of related sequences — is a foundational problem in computational biology. Conventional pipelines are multi-stage: sequences are first aligned, and a tree is then inferred under a likelihood-based or distance-based criterion, typically with iterative heuristic search over the enormous space of possible topologies. BetaInfer, introduced in a 2026 bioRxiv preprint by Edo Dotan and colleagues, reframes this entire process as a single sequence transduction task, learning to map an unaligned set of molecular sequences directly to a tree.
BetaInfer treats a tree as a string in Newick notation and trains a hybrid transformer-based architecture to generate that string from the raw input sequences, borrowing the encoder–decoder transduction paradigm from natural language processing. This places it in the same lineage as the authors' earlier BetaAlign work, which applied transformers to multiple sequence alignment, and reflects a broader shift toward learned, end-to-end replacements for classical bioinformatics pipelines.
The model is trained once on large-scale simulated evolutionary data with known ground-truth trees and then applied — without any retraining or fine-tuning — to both held-out simulated datasets and empirical datasets, making it a genuine pretrained, generative foundation model for phylogenetics rather than a per-dataset optimizer.
BetaInfer uses hybrid transformer-based encoder–decoder architectures that consume a set of unaligned sequences and autoregressively emit a tree as a Newick string. Training relies on large-scale simulation: evolutionary histories with known ground-truth topologies are generated, and the model learns to recover the generating tree from the resulting sequences. Because supervision comes from simulated trees, the approach sidesteps the need for curated empirical training labels, and the same trained model is then applied zero-shot to new data. The reported headline result is that ensemble-based generation of candidate trees lowers reconstruction error by over 30% compared with single predictions, while remaining competitive against standard likelihood- and distance-based baselines. Interpretability analysis suggests the network implicitly computes pairwise distances between sequences as part of its inference. The work is released as a preprint under a CC BY-NC license; at the time of writing no public code or model weights were available.
BetaInfer targets researchers in molecular evolution, comparative genomics, and systematics who need to reconstruct phylogenies from sets of related sequences. By replacing a multi-stage alignment-plus-search pipeline with a single forward pass, it offers a potentially faster and more scalable route to candidate trees, with the ensemble mechanism providing a built-in way to trade compute for accuracy. Its zero-shot applicability to empirical data means practitioners could, in principle, apply a pretrained model without configuring substitution models or tuning search heuristics for each dataset.
BetaInfer is part of an emerging body of work demonstrating that generative, NLP-style models can serve as viable and scalable alternatives to classical phylogenetic inference pipelines. By showing competitive accuracy and a substantial ensemble-driven error reduction on both simulated and real data, it strengthens the case for learned end-to-end methods in a domain long dominated by likelihood and distance heuristics. As a preprint without released code or weights, its near-term adoption is limited and its results await independent benchmarking, but it points toward a future in which pretrained foundation models handle core comparative-genomics tasks directly from raw sequences.
Dotan, E., et al. (2026) Phylogenetic tree inference using generative models. bioRxiv.
DOI: 10.64898/2026.06.14.732140Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data