Generative hybrid-transformer for ancestral protein sequence reconstruction that needs no MSA or phylogenetic tree and outperforms maximum-likelihood ASR pipelines.
BetaReconstruct is a generative deep-learning model for ancestral sequence reconstruction (ASR), the task of inferring the protein sequences of extinct ancestral organisms from their extant descendants. Developed by researchers at Tel Aviv University and released as a bioRxiv preprint in January 2026, it reframes ASR as a sequence-to-sequence generation problem rather than the position-by-position statistical inference used by conventional phylogenetic pipelines.
ASR has long been a cornerstone of molecular evolution and protein engineering: resurrected ancestral proteins are frequently more thermostable, more soluble, and more catalytically promiscuous than their modern counterparts, making them attractive scaffolds for enzyme design and biotechnology. The dominant approach pairs a multiple sequence alignment (MSA) and an inferred phylogenetic tree with a probabilistic substitution model, then reconstructs the maximum-likelihood (ML) sequence at internal tree nodes. This pipeline is powerful but brittle — its output is sensitive to alignment errors, tree topology uncertainty, and substitution-model assumptions, and it struggles with insertions and deletions.
BetaReconstruct sidesteps these dependencies entirely. It is a hybrid-transformer model trained first on large-scale simulated evolutionary datasets, where the ground-truth ancestral sequences are known by construction, and then fine-tuned on real protein families. Because the model learns to map a set of related extant sequences directly to a predicted ancestor, it requires neither an MSA nor an explicit phylogenetic tree at inference time.
BetaReconstruct is built on a hybrid-transformer generative architecture that ingests a collection of related extant protein sequences and emits a predicted ancestral sequence. Training proceeds in two stages. In the first, the model is pretrained on large-scale in-silico evolutionary datasets generated by simulating sequence evolution along trees; because these simulations track the true ancestral state at every internal node, they supply an effectively unlimited stream of perfectly labeled training pairs. In the second stage, the model is fine-tuned on real protein families to close the gap between simulated and natural sequence statistics. At inference, the model does not construct an alignment or estimate a tree, and it does not require a user-specified substitution model. The authors benchmark reconstruction accuracy against standard maximum-likelihood ASR pipelines and report improved performance. As of the preprint, no public code or trained weights have been released, which currently limits independent reproduction.
Ancestral sequence reconstruction is widely used to engineer robust enzymes, to study the evolutionary trajectory of protein families, and to test hypotheses about historical biochemistry. BetaReconstruct is aimed at protein engineers and evolutionary biologists who want fast, alignment-independent ancestor predictions — for example, to generate thermostable enzyme variants for industrial biocatalysis, to probe the emergence of new functions across a gene family, or to design ancestral scaffolds as starting points for directed evolution. By removing the manual, error-prone alignment and tree-building steps, it lowers the barrier to running ASR on large or poorly characterized protein families.
BetaReconstruct contributes to a broader shift in computational evolution toward learned, end-to-end models that replace hand-built statistical pipelines, paralleling how protein language models have displaced alignment-based methods in structure and function prediction. If its reported gains over maximum-likelihood ASR hold under independent evaluation, it could make ancestral reconstruction both faster and more accessible, particularly for families where reliable alignments and trees are difficult to obtain. The principal caveats are that the work is a preprint, that performance on simulated data may not fully transfer to deeply divergent real families, and that the absence of released code or weights makes the results difficult to reproduce or deploy at present.