Phylogenetic tree inference — reconstructing the evolutionary relationships among a set of related sequences — is a foundational problem in computational biology. Conventional pipelines are multi-stage: sequences are first aligned, and a tree is then inferred under a likelihood-based or distance-based criterion, typically with iterative heuristic search over the enormous space of possible topologies. BetaInfer, introduced in a 2026 bioRxiv preprint by Edo Dotan and colleagues, reframes this entire process as a single sequence transduction task, learning to map an unaligned set of molecular sequences directly to a tree.

BetaInfer treats a tree as a string in Newick notation and trains a hybrid transformer-based architecture to generate that string from the raw input sequences, borrowing the encoder–decoder transduction paradigm from natural language processing. This places it in the same lineage as the authors' earlier BetaAlign work, which applied transformers to multiple sequence alignment, and reflects a broader shift toward learned, end-to-end replacements for classical bioinformatics pipelines.

The model is trained once on large-scale simulated evolutionary data with known ground-truth trees and then applied — without any retraining or fine-tuning — to both held-out simulated datasets and empirical datasets, making it a genuine pretrained, generative foundation model for phylogenetics rather than a per-dataset optimizer.

Key Features

Sequence-to-tree transduction: Maps a set of unaligned input sequences directly to a Newick-format tree, collapsing alignment, distance estimation, and tree search into one learned generative step.
Zero-shot generalization to real data: A single fixed checkpoint trained on simulated evolution is evaluated on simulated and empirical datasets without retraining, demonstrating transfer beyond the training distribution.
Ensemble candidate generation: Sampling multiple candidate trees and aggregating them reduces reconstruction error by more than 30% relative to a single greedy prediction.
Competitive accuracy: Reconstructions are competitive with established likelihood-based and distance-based phylogenetic methods.
Interpretable internal mechanism: Analysis of the trained model indicates it leverages internal pairwise-distance computations, echoing the logic of classical distance-based inference.

Technical Details

BetaInfer uses hybrid transformer-based encoder–decoder architectures that consume a set of unaligned sequences and autoregressively emit a tree as a Newick string. Training relies on large-scale simulation: evolutionary histories with known ground-truth topologies are generated, and the model learns to recover the generating tree from the resulting sequences. Because supervision comes from simulated trees, the approach sidesteps the need for curated empirical training labels, and the same trained model is then applied zero-shot to new data. The reported headline result is that ensemble-based generation of candidate trees lowers reconstruction error by over 30% compared with single predictions, while remaining competitive against standard likelihood- and distance-based baselines. Interpretability analysis suggests the network implicitly computes pairwise distances between sequences as part of its inference. The work is released as a preprint under a CC BY-NC license; at the time of writing no public code or model weights were available.

Applications

BetaInfer targets researchers in molecular evolution, comparative genomics, and systematics who need to reconstruct phylogenies from sets of related sequences. By replacing a multi-stage alignment-plus-search pipeline with a single forward pass, it offers a potentially faster and more scalable route to candidate trees, with the ensemble mechanism providing a built-in way to trade compute for accuracy. Its zero-shot applicability to empirical data means practitioners could, in principle, apply a pretrained model without configuring substitution models or tuning search heuristics for each dataset.

Impact

BetaInfer is part of an emerging body of work demonstrating that generative, NLP-style models can serve as viable and scalable alternatives to classical phylogenetic inference pipelines. By showing competitive accuracy and a substantial ensemble-driven error reduction on both simulated and real data, it strengthens the case for learned end-to-end methods in a domain long dominated by likelihood and distance heuristics. As a preprint without released code or weights, its near-term adoption is limited and its results await independent benchmarking, but it points toward a future in which pretrained foundation models handle core comparative-genomics tasks directly from raw sequences.

Key Features

Sequence-to-tree transduction: Maps a set of unaligned input sequences directly to a Newick-format tree, collapsing alignment, distance estimation, and tree search into one learned generative step.

Zero-shot generalization to real data: A single fixed checkpoint trained on simulated evolution is evaluated on simulated and empirical datasets without retraining, demonstrating transfer beyond the training distribution.

Ensemble candidate generation: Sampling multiple candidate trees and aggregating them reduces reconstruction error by more than 30% relative to a single greedy prediction.

Competitive accuracy: Reconstructions are competitive with established likelihood-based and distance-based phylogenetic methods.

Interpretable internal mechanism: Analysis of the trained model indicates it leverages internal pairwise-distance computations, echoing the logic of classical distance-based inference.

Technical Details

Applications

Impact

BetaInfer

Key Features

Technical Details

Applications

Impact

Citation

Phylogenetic tree inference using generative models

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

BetaInfer

Key Features

Technical Details

Applications

Impact

Citation

Phylogenetic tree inference using generative models

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

BetaInfer

#Key Features

#Technical Details

#Applications

#Impact

Citation

Phylogenetic tree inference using generative models

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

BetaInfer

#Key Features

#Technical Details

#Applications

#Impact

Citation

Phylogenetic tree inference using generative models

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact