bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Single-cell

Zebraformer

MNCN-CSIC

A BERT-style transformer language model built on the Geneformer framework and trained on zebrafish single-cell transcriptomics to produce gene and cell embeddings for developmental analysis.

Released: July 2025

Zebraformer is a transformer-based language model for zebrafish (Danio rerio) single-cell transcriptomics, developed by Juan F. Poyatos at the Museo Nacional de Ciencias Naturales (MNCN-CSIC), part of the Spanish National Research Council, and first released as a bioRxiv preprint in July 2025. Most single-cell foundation models — including Geneformer, on whose framework Zebraformer is built — are pretrained on human or mouse data. Zebraformer instead targets the zebrafish, one of the most important vertebrate model organisms for developmental biology, regeneration, and disease modeling, where no dedicated foundation model previously existed.

The model is trained on the Zebrahub atlas, a single-cell timecourse spanning zebrafish development from 10 hours post-fertilization (hpf) to 10 days post-fertilization (dpf). From this data it learns gene and cell embeddings that encode temporal progression along development, anatomical and cell-type identity, and regulatory relationships between genes. A defining design choice is that these learned embeddings are used directly, without any task-specific re-training: downstream analyses operate on the frozen representations produced by the pretrained model.

Using this approach, Poyatos demonstrates clustering of cell states, in silico perturbation analysis, inference of gene regulatory networks (GRNs), and a quantitative test of the developmental "hourglass" hypothesis — the idea that mid-embryonic (phylotypic) stages are the most conserved phase of development. Zebraformer thus serves both as a practical tool for zebrafish researchers and as a case study in transferring the single-cell language-model paradigm to a non-mammalian organism.

#Key Features

  • Zebrafish-specific pretraining: Trained on the Zebrahub developmental atlas (10 hpf to 10 dpf), Zebraformer fills a gap left by human- and mouse-centric single-cell foundation models for a key vertebrate model organism.
  • Embeddings used without re-training: Gene and cell embeddings from the frozen pretrained model are applied directly to downstream tasks, avoiding fine-tuning and lowering the data and compute barrier for individual labs.
  • Developmental structure encoded: The learned representations capture temporal progression, anatomical and cell-type identity, and gene regulatory relationships across embryonic and larval stages.
  • Multiple downstream analyses from one model: A single set of embeddings supports cell-state clustering, in silico perturbation, GRN inference, and a developmental hourglass analysis.
  • Open weights and code: Trained weights (model.safetensors, ~57 MB, plus config.json) are deposited on Zenodo under CC-BY 4.0, and analysis code is available on GitHub.

#Technical Details

Zebraformer is a BERT-style transformer encoder built on the Geneformer framework, applying the rank-value gene-encoding and masked-gene pretraining approach of Geneformer to zebrafish transcriptomes rather than mammalian ones. Pretraining data is drawn from the Zebrahub single-cell atlas, which densely samples zebrafish development from 10 hours to 10 days post-fertilization, giving the model exposure to the full embryonic-to-larval transition. The released checkpoint is distributed as a 57 MB model.safetensors file with an accompanying config.json on Zenodo (record 18559841, published February 2026, CC-BY 4.0). The work is provided as a research codebase — a set of notebooks layered on top of the Geneformer library — rather than as an installable software package, and the GitHub repository does not state an explicit license even though the deposited weights are CC-BY 4.0. The model has not been benchmarked for cross-organism transfer and is intended specifically for zebrafish data.

#Applications

Zebraformer is aimed at zebrafish developmental biologists who want to mine single-cell timecourse data without training a model from scratch. Because embeddings are used directly, researchers can cluster cells into states and lineages, run in silico perturbations to predict the effects of gene knockouts or overexpression, and infer gene regulatory networks underlying developmental transitions. The developmental hourglass analysis illustrates how the embeddings can be used to test evolutionary and developmental hypotheses quantitatively. More broadly, the work offers a template for adapting single-cell language models to other non-mammalian model organisms.

#Impact

Zebraformer extends the single-cell foundation-model paradigm — established for human and mouse by models such as Geneformer and scGPT — to the zebrafish, a workhorse of developmental, regeneration, and disease research. By showing that frozen Geneformer-style embeddings can support clustering, perturbation, GRN inference, and developmental hypothesis testing in a new organism, it provides both a usable resource for the zebrafish community and evidence that the approach generalizes beyond mammals. As a single-author preprint with open weights, its main limitations are practical: it is zebrafish-specific and not validated for cross-organism transfer, it ships as notebooks rather than a packaged tool, and the code repository lacks an explicit license despite the Zenodo weights being openly licensed.

Citation

A transformer-based language model reveals developmental constraint and network complexity during zebrafish embryogenesis

Preprint

Poyatos, J. F. (2026) A transformer-based language model reveals developmental constraint and network complexity during zebrafish embryogenesis. bioRxiv.

DOI: 10.1101/2025.07.09.663853

Openness

Unclassified
Restrictive license on core components

Tags

bertcell_embeddingsdevelopmental_biologyembeddingsfoundation_modelgene_regulatory_network_inferenceperturbation_analysisself_supervisedtranscriptomicstransformer

Resources

GitHub RepositoryResearch PaperDataset