bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein

LucaPhylo

Alibaba Cloud / Sun Yat-sen University / University of Sydney

A hyperbolic protein language model for alignment-free phylogenetic inference, producing distance matrices for tree placement without multiple sequence alignment.

Released: May 2026
Parameters: 650 Million

Phylogenetic inference — reconstructing the evolutionary relationships among sequences — is a cornerstone of molecular biology, but conventional methods depend on multiple sequence alignment (MSA), a step that becomes unreliable or intractable for highly divergent sequences such as those found across the viral tree of life. LucaPhylo addresses this bottleneck with an alignment-free approach: it learns to map protein sequences into a representation space whose geometry encodes evolutionary distance directly, sidestepping MSA altogether.

LucaPhylo is a protein language model that adapts the pretrained ESM2-650M backbone to a hyperbolic embedding space, which is naturally suited to representing the branching, tree-like structure of evolutionary relationships. From these embeddings the model produces pairwise distance matrices that can be fed into standard tree-building or phylogenetic placement procedures. It was developed by the LucaOne research group at Alibaba Cloud together with collaborators at Sun Yat-sen University and the University of Sydney, and released as a bioRxiv preprint in May 2026.

The model is positioned as a practical tool for placing newly discovered viral sequences onto reference phylogenies, a task where alignment-based pipelines frequently fail because of extreme sequence divergence. By learning evolutionary distance from sequence alone, LucaPhylo extends the reach of phylogenetics into regions of sequence space that have been difficult to analyze with traditional methods.

#Key Features

  • Alignment-free inference: Produces phylogenetic distance matrices directly from sequences, removing the dependence on multiple sequence alignment that limits conventional methods on divergent inputs.
  • Hyperbolic geometry: Embeds protein sequences in a hyperbolic space whose curvature matches the hierarchical, tree-like structure of evolutionary relationships, improving the fidelity of learned distances.
  • Continued pretraining from ESM2: Builds on the 650M-parameter ESM2 backbone, inheriting broad protein representation knowledge before specializing for phylogenetic tasks.
  • Four-stage cascaded pipeline: Combines successive stages of adaptation and refinement to convert raw embeddings into reliable phylogenetic distances.
  • Zero- and few-shot placement: Supports placing query sequences onto reference trees with little or no task-specific tuning, and is benchmarked against established placement tools.

#Technical Details

LucaPhylo continues pretraining from the ESM2-650M transformer and introduces a hyperbolic adaptation so that embedding-space distances correspond to evolutionary distances. Training used 22,063 representative viral polyprotein sequences spanning 180 viral supergroups, giving the model exposure to the deep divergence characteristic of the viral world. The full method is structured as a four-stage cascaded pipeline that progressively transforms sequence embeddings into pairwise distance matrices suitable for downstream tree construction and placement. The authors evaluate LucaPhylo on phylogenetic placement in zero-shot and few-shot settings, comparing it against alignment-based tools including EPA-ng, pplacer, and the deep-learning placement method H-DEPP. Model weights are released on HuggingFace and Zenodo, with code on GitHub under the Apache 2.0 license; a companion repository targets RNA virus applications.

#Applications

LucaPhylo is aimed at virologists, evolutionary biologists, and metagenomics researchers who need to classify and place novel sequences — particularly viral polyproteins — onto reference phylogenies when alignment is unreliable. Typical use cases include rapid placement of newly sequenced viruses during outbreak surveillance, characterizing viral diversity uncovered in environmental or metagenomic sampling, and building distance matrices for downstream phylogenetic analysis without the manual curation that MSA-based pipelines require.

#Impact

By demonstrating that a hyperbolic protein language model can perform competitive phylogenetic placement without alignment, LucaPhylo offers a path toward scalable analysis of the most divergent regions of sequence space, where traditional methods break down. As a preprint released in 2026 its long-term adoption is still emerging, and results should be interpreted with the usual caveats for non-peer-reviewed work. Its open release of code and weights under a permissive license, together with a dedicated RNA virus pipeline, makes it readily available for the virology and evolutionary genomics communities to test and extend.

Citation

Alignment-free phylogenetic inference via hyperbolic protein language models

Shan, Y., et al. (2026) Alignment-free phylogenetic inference via hyperbolic protein language models. bioRxiv.

DOI: 10.64898/2026.05.26.723419

Openness

Class II
Open Tooling

Tags

few_shotlanguage_modelphylogenetic_inferencerepresentation_learningtransformertree_placementviral_evolutionzero_shot

Resources

GitHub RepositoryGitHub RepositoryResearch PaperHuggingFace Model