A BERT-style transformer language model built on the Geneformer framework and trained on zebrafish single-cell transcriptomics to produce gene and cell embeddings for developmental analysis.
Zebraformer is a transformer-based language model for zebrafish (Danio rerio) single-cell transcriptomics, developed by Juan F. Poyatos at the Museo Nacional de Ciencias Naturales (MNCN-CSIC), part of the Spanish National Research Council, and first released as a bioRxiv preprint in July 2025. Most single-cell foundation models — including Geneformer, on whose framework Zebraformer is built — are pretrained on human or mouse data. Zebraformer instead targets the zebrafish, one of the most important vertebrate model organisms for developmental biology, regeneration, and disease modeling, where no dedicated foundation model previously existed.
The model is trained on the Zebrahub atlas, a single-cell timecourse spanning zebrafish development from 10 hours post-fertilization (hpf) to 10 days post-fertilization (dpf). From this data it learns gene and cell embeddings that encode temporal progression along development, anatomical and cell-type identity, and regulatory relationships between genes. A defining design choice is that these learned embeddings are used directly, without any task-specific re-training: downstream analyses operate on the frozen representations produced by the pretrained model.
Using this approach, Poyatos demonstrates clustering of cell states, in silico perturbation analysis, inference of gene regulatory networks (GRNs), and a quantitative test of the developmental "hourglass" hypothesis — the idea that mid-embryonic (phylotypic) stages are the most conserved phase of development. Zebraformer thus serves both as a practical tool for zebrafish researchers and as a case study in transferring the single-cell language-model paradigm to a non-mammalian organism.
model.safetensors, ~57 MB, plus config.json) are deposited on Zenodo under CC-BY 4.0, and analysis code is available on GitHub.Zebraformer is a BERT-style transformer encoder built on the Geneformer framework, applying the rank-value gene-encoding and masked-gene pretraining approach of Geneformer to zebrafish transcriptomes rather than mammalian ones. Pretraining data is drawn from the Zebrahub single-cell atlas, which densely samples zebrafish development from 10 hours to 10 days post-fertilization, giving the model exposure to the full embryonic-to-larval transition. The released checkpoint is distributed as a 57 MB model.safetensors file with an accompanying config.json on Zenodo (record 18559841, published February 2026, CC-BY 4.0). The work is provided as a research codebase — a set of notebooks layered on top of the Geneformer library — rather than as an installable software package, and the GitHub repository does not state an explicit license even though the deposited weights are CC-BY 4.0. The model has not been benchmarked for cross-organism transfer and is intended specifically for zebrafish data.
Zebraformer is aimed at zebrafish developmental biologists who want to mine single-cell timecourse data without training a model from scratch. Because embeddings are used directly, researchers can cluster cells into states and lineages, run in silico perturbations to predict the effects of gene knockouts or overexpression, and infer gene regulatory networks underlying developmental transitions. The developmental hourglass analysis illustrates how the embeddings can be used to test evolutionary and developmental hypotheses quantitatively. More broadly, the work offers a template for adapting single-cell language models to other non-mammalian model organisms.
Zebraformer extends the single-cell foundation-model paradigm — established for human and mouse by models such as Geneformer and scGPT — to the zebrafish, a workhorse of developmental, regeneration, and disease research. By showing that frozen Geneformer-style embeddings can support clustering, perturbation, GRN inference, and developmental hypothesis testing in a new organism, it provides both a usable resource for the zebrafish community and evidence that the approach generalizes beyond mammals. As a single-author preprint with open weights, its main limitations are practical: it is zebrafish-specific and not validated for cross-organism transfer, it ships as notebooks rather than a packaged tool, and the code repository lacks an explicit license despite the Zenodo weights being openly licensed.
Poyatos, J. F. (2026) A transformer-based language model reveals developmental constraint and network complexity during zebrafish embryogenesis. bioRxiv.
DOI: 10.1101/2025.07.09.663853