bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
DNA & Gene foundation models
DNA & Gene

evoRate

University of Toronto

A genome language model that adds evolutionary-rate prediction as a pretraining task, improving representations and variant effect prediction over sequence-only training.

Released: February 2026

evoRate is a genome language model (gLM) training approach that introduces evolutionary rate prediction as a pretraining objective, described in a February 2026 bioRxiv preprint led by researchers at the University of Toronto (with collaborators including Microsoft Research and the Broad Institute). Most genome language models are pretrained with sequence-reconstruction objectives borrowed from natural language processing — masked or autoregressive token prediction — yet recent studies have shown that such models often fail to capture meaningful biological signal. evoRate addresses this gap by training the model to predict how fast each position in the genome evolves.

The key design choice is that the evolutionary-rate objectives are composable with standard sequence reconstruction. This enables a clean, controlled comparison between predicting sequence only, evolutionary rate only, or both together, isolating the contribution of the evolutionary signal. To support this analysis, the authors build a suite of biologically grounded benchmarks, since existing gLM evaluations have notable gaps in measuring whether models learn functional biology.

By making evolution an explicit training target rather than an emergent hope, evoRate contributes to a broader shift in genomic foundation modeling toward objectives that encode functional and evolutionary constraints directly.

#Key Features

  • Evolutionary-rate pretraining: Adds objectives that predict the rate of evolution at genomic positions, encoding functional constraint as a direct training signal.
  • Composable objectives: Evolutionary-rate tasks combine with sequence reconstruction, enabling controlled sequence-only vs. rate-only vs. combined comparisons.
  • Biologically grounded benchmarks: Introduces a new evaluation suite designed to address gaps in existing gLM benchmarks for functional and regulatory signal.
  • Parameter efficiency: Training on evolutionary rate makes relatively small models competitive with much larger existing gLMs on some tasks.
  • Variant effect gains: Models trained on both sequence and evolutionary rate outperform sequence-only models on established variant effect prediction benchmarks.

#Technical Details

evoRate augments transformer-based genome language model pretraining with evolutionary rate prediction tasks — including predicting the evolutionary rate at each position given the preceding sequence — that can be composed with conventional sequence reconstruction. Across the authors' new biologically grounded benchmarks and on established variant effect prediction benchmarks, models pretrained on both sequence and evolutionary rate consistently outperform those trained on sequence alone. Notably, incorporating the evolutionary-rate objective allows the relatively small models studied here to rival substantially larger existing gLMs on certain tasks, establishing evolution as a key training target for genome-scale models. As a recent preprint, no public code or weight release is referenced in the manuscript.

#Applications

evoRate is aimed at regulatory genomics and variant interpretation, where unlabeled genome language models promise to advance understanding without curated training labels. Improved representations and variant effect prediction make the approach relevant for prioritizing noncoding and coding variants, studying functional constraint, and building more sample-efficient genomic foundation models for downstream genomics tasks.

#Impact

evoRate provides evidence that evolution-aware pretraining objectives address a recognized weakness of sequence-only genome language models — their tendency to miss biological signal — and that they can substitute for raw scale on some tasks. By releasing a biologically grounded benchmark suite alongside the method, the work also offers tools to better measure functional understanding in gLMs. As an unreviewed preprint without a referenced code release, the breadth of these gains awaits independent replication.

Tags

variant_effect_predictionregulatory_genomicstransformerself_supervisedrepresentation_learninggenomicsmolecular_evolution