A 265M-parameter genomic foundation model that uses biologically-informed BioToken tokenization to match or exceed much larger models at a fraction of the compute.
BioFM is a 265-million-parameter genomic foundation model developed by M42, an Abu Dhabi-based health technology company, and released as a bioRxiv preprint in March 2025. It addresses a persistent tension in genomic deep learning: models that capture functionally important signals — particularly the effects of genetic variants — have tended to require enormous scale and compute, while smaller, more efficient models sacrifice accuracy on the downstream tasks that matter most for clinical and biological interpretation.
The central innovation behind BioFM is not the network itself but how the genome is encoded for it. The authors introduce BioToken, a biologically-informed tokenization scheme that folds genomic variants and biological region annotations directly into the token stream the model consumes. Rather than treating DNA as an undifferentiated string of nucleotides or fixed k-mers, BioToken makes variant identity and functional context first-class inputs, so the model can learn representations that are sensitive to the kinds of sequence changes and genomic regions that drive phenotype. BioFM is the foundation model trained on top of this tokenization.
This pairing lets a comparatively compact model compete with — and on several tasks exceed — specialized predictors and foundation models that are an order of magnitude or more larger, demonstrating that thoughtful, biology-aware input representation can substitute for raw parameter count.
BioFM is a transformer-based genomic foundation model with approximately 265 million parameters, pretrained in a self-supervised fashion on genomic sequence encoded via the BioToken scheme. BioToken's defining property is that genomic variants and biological region annotations are incorporated into tokens, rather than being supplied only as downstream labels or post-hoc features — an input-level design choice the authors argue is responsible for the model's variant sensitivity. The preprint reports evaluation across four task families: noncoding pathogenicity, expression modulation, sQTL (splicing quantitative trait loci) prediction, and long-range genomic interactions. Across these, BioFM is reported as competitive with or state-of-the-art relative to specialized baselines such as Enformer and SpliceTransformer and to genomic foundation models including the Nucleotide Transformer, and competitive with models up to 7B parameters despite its much smaller size and lower compute footprint. As a preprint (v2, November 2025), these results await peer review and independent replication.
BioFM targets researchers in regulatory and clinical genomics who need accurate interpretation of genetic variants without access to large-scale compute. Its native handling of variants makes it directly applicable to noncoding pathogenicity assessment, prioritizing variants of uncertain significance, predicting effects on gene expression, identifying splice-altering variants through sQTL prediction, and modeling long-range regulatory interactions. The model's efficiency lowers the barrier for groups that cannot run multi-billion-parameter foundation models, making variant-aware genomic prediction more broadly accessible.
BioFM contributes to an active debate in genomic AI about whether scale or representation is the binding constraint on performance. By showing that a 265M-parameter model with biologically-informed tokenization can rival models more than 20 times larger, it offers evidence that encoding domain knowledge into the input — here, variants and region annotations via BioToken — can be a more compute-efficient path than simply enlarging the network. As of June 2026, the work remains a preprint and its full influence is still emerging. Notably, the authors state that code and model checkpoints are to be provided, but the m42-health/BioFM GitHub and HuggingFace repositories did not resolve as of June 2026, and the weight license is unconfirmed; the paper itself is released under CC-BY-NC. Prospective users should verify artifact availability and licensing terms before adoption.
Medvedev, A., et al. (2025) BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models. bioRxiv.
DOI: 10.1101/2025.03.27.645711Papers that recently cited this model.
Ariel Larey, Elay Dahan, Amit Bleiweiss, et al.
arXiv.org · Feb 2026
Ariel Larey, Elay Dahan, Amit Bleiweiss, et al.
bioRxiv · Feb 2026
Marcell Veiner, Fran Supek
Molecular Systems Biology · Jan 2026
The most-cited papers that cite this model.
Ariel Larey, Elay Dahan, Amit Bleiweiss, et al.
arXiv.org · Feb 2026
Ariel Larey, Elay Dahan, Amit Bleiweiss, et al.
bioRxiv · Feb 2026
Benjamin Kuznets-Speck, Leon Schwartz, Hanxiao Sun, et al.
bioRxiv · Jul 2025
Marcell Veiner, Fran Supek
Molecular Systems Biology · Jan 2026
Jaskaran Singh, Prabhav Sanga, A. K. Dubey, et al.
Share of papers citing this model.