BetaReconstruct

Generative transformer for ancestral protein sequence reconstruction that needs no multiple sequence alignment or phylogenetic tree as input.

Released: January 2026

BetaReconstruct is a generative deep-learning model for ancestral sequence reconstruction (ASR), the task of inferring the protein sequences of extinct ancestral organisms from their extant descendants. Developed by researchers at Tel Aviv University and released as a bioRxiv preprint in January 2026, it reframes ASR as a sequence-to-sequence generation problem rather than the position-by-position statistical inference used by conventional phylogenetic pipelines.

ASR has long been a cornerstone of molecular evolution and protein engineering: resurrected ancestral proteins are frequently more thermostable, more soluble, and more catalytically promiscuous than their modern counterparts, making them attractive scaffolds for enzyme design and biotechnology. The dominant approach pairs a multiple sequence alignment (MSA) and an inferred phylogenetic tree with a probabilistic substitution model, then reconstructs the maximum-likelihood (ML) sequence at internal tree nodes. This pipeline is powerful but brittle — its output is sensitive to alignment errors, tree topology uncertainty, and substitution-model assumptions, and it struggles with insertions and deletions.

BetaReconstruct sidesteps these dependencies entirely. It is a hybrid-transformer model trained first on large-scale simulated evolutionary datasets, where the ground-truth ancestral sequences are known by construction, and then fine-tuned on real protein families. Because the model learns to map a set of related extant sequences directly to a predicted ancestor, it requires neither an MSA nor an explicit phylogenetic tree at inference time.

Key Features

MSA- and tree-free reconstruction: Predicts ancestral sequences directly from a set of extant homologs, removing the alignment and phylogeny-building steps that dominate the error budget of classical ASR.
Simulation-pretrained, real-fine-tuned: Trained on large simulated evolutionary datasets with known ancestral ground truth, then fine-tuned on real proteins, combining abundant labeled supervision with biological realism.
Hybrid-transformer architecture: Couples transformer-based sequence modeling with a generative decoder, enabling the model to handle indels and variable-length reconstructions that challenge position-wise ML inference.
Outperforms maximum-likelihood pipelines: Reported to exceed the accuracy of established ML-based ASR tools on benchmark reconstructions, providing a learned alternative to model-based inference.

Technical Details

BetaReconstruct is built on a hybrid-transformer generative architecture that ingests a collection of related extant protein sequences and emits a predicted ancestral sequence. Training proceeds in two stages. In the first, the model is pretrained on large-scale in-silico evolutionary datasets generated by simulating sequence evolution along trees; because these simulations track the true ancestral state at every internal node, they supply an effectively unlimited stream of perfectly labeled training pairs. In the second stage, the model is fine-tuned on real protein families to close the gap between simulated and natural sequence statistics. At inference, the model does not construct an alignment or estimate a tree, and it does not require a user-specified substitution model. The authors benchmark reconstruction accuracy against standard maximum-likelihood ASR pipelines and report improved performance. As of the preprint, no public code or trained weights have been released, which currently limits independent reproduction.

Applications

Ancestral sequence reconstruction is widely used to engineer robust enzymes, to study the evolutionary trajectory of protein families, and to test hypotheses about historical biochemistry. BetaReconstruct is aimed at protein engineers and evolutionary biologists who want fast, alignment-independent ancestor predictions — for example, to generate thermostable enzyme variants for industrial biocatalysis, to probe the emergence of new functions across a gene family, or to design ancestral scaffolds as starting points for directed evolution. By removing the manual, error-prone alignment and tree-building steps, it lowers the barrier to running ASR on large or poorly characterized protein families.

Impact

BetaReconstruct contributes to a broader shift in computational evolution toward learned, end-to-end models that replace hand-built statistical pipelines, paralleling how protein language models have displaced alignment-based methods in structure and function prediction. If its reported gains over maximum-likelihood ASR hold under independent evaluation, it could make ancestral reconstruction both faster and more accessible, particularly for families where reliable alignments and trees are difficult to obtain. The principal caveats are that the work is a preprint, that performance on simulated data may not fully transfer to deeply divergent real families, and that the absence of released code or weights makes the results difficult to reproduce or deploy at present.

Citation

Ancestral sequence reconstruction using generative models

Dotan, E., et al. (2026) Ancestral sequence reconstruction using generative models. bioRxiv.

DOI: 10.64898/2026.01.18.700141

Recent citations

Papers that recently cited this model.

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction
Emil R. Sharafutdinov, I. André
May 2026
0

Top citations

The most-cited papers that cite this model.

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction
Emil R. Sharafutdinov, I. André
May 2026
0

Citations

Total Citations1

Influential0

References20

Fields of citing research

Biology100%
Computer Science100%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

4Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

MSA- and tree-free reconstruction: Predicts ancestral sequences directly from a set of extant homologs, removing the alignment and phylogeny-building steps that dominate the error budget of classical ASR.

Simulation-pretrained, real-fine-tuned: Trained on large simulated evolutionary datasets with known ancestral ground truth, then fine-tuned on real proteins, combining abundant labeled supervision with biological realism.

Hybrid-transformer architecture: Couples transformer-based sequence modeling with a generative decoder, enabling the model to handle indels and variable-length reconstructions that challenge position-wise ML inference.

Outperforms maximum-likelihood pipelines: Reported to exceed the accuracy of established ML-based ASR tools on benchmark reconstructions, providing a learned alternative to model-based inference.

Technical Details

Applications

Impact

BetaReconstruct

Key Features

Technical Details

Applications

Impact

Citation

Ancestral sequence reconstruction using generative models

Recent citations

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Top citations

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Citations

Fields of citing research

Openness

Tags

Resources

BetaReconstruct

Key Features

Technical Details

Applications

Impact

Citation

Ancestral sequence reconstruction using generative models

Recent citations

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Top citations

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Citations

Fields of citing research

Openness

Tags

Resources

BetaReconstruct

#Key Features

#Technical Details

#Applications

#Impact

Citation

Ancestral sequence reconstruction using generative models

Recent citations

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Top citations

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Related models

Citations

Fields of citing research

Openness

Tags

Resources

BetaReconstruct

#Key Features

#Technical Details

#Applications

#Impact

Citation

Ancestral sequence reconstruction using generative models

Recent citations

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Top citations

Tree-Conditioned Edit Flows for Ancestral Sequence Reconstruction

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact