BioFM

A 265M-parameter genomic foundation model that uses biologically-informed BioToken tokenization to match or exceed much larger models at a fraction of the compute.

Released: March 2025

Parameters: 265 Million

BioFM is a 265-million-parameter genomic foundation model developed by M42, an Abu Dhabi-based health technology company, and released as a bioRxiv preprint in March 2025. It addresses a persistent tension in genomic deep learning: models that capture functionally important signals — particularly the effects of genetic variants — have tended to require enormous scale and compute, while smaller, more efficient models sacrifice accuracy on the downstream tasks that matter most for clinical and biological interpretation.

The central innovation behind BioFM is not the network itself but how the genome is encoded for it. The authors introduce BioToken, a biologically-informed tokenization scheme that folds genomic variants and biological region annotations directly into the token stream the model consumes. Rather than treating DNA as an undifferentiated string of nucleotides or fixed k-mers, BioToken makes variant identity and functional context first-class inputs, so the model can learn representations that are sensitive to the kinds of sequence changes and genomic regions that drive phenotype. BioFM is the foundation model trained on top of this tokenization.

This pairing lets a comparatively compact model compete with — and on several tasks exceed — specialized predictors and foundation models that are an order of magnitude or more larger, demonstrating that thoughtful, biology-aware input representation can substitute for raw parameter count.

Key Features

BioToken tokenization: A biologically-informed tokenization framework that encodes genomic variants and biological region annotations into the token representation, giving the model explicit access to variant identity and functional context rather than raw sequence alone.
Compute efficiency: At 265M parameters, BioFM reaches competitive or state-of-the-art accuracy against models up to 7B parameters while using substantially less training and inference compute.
Variant-centric design: Because variants are represented natively through BioToken, the model is well suited to variant-effect tasks such as noncoding pathogenicity prediction, where many sequence models struggle.
Broad task coverage: A single pretrained backbone is evaluated across noncoding pathogenicity, gene expression modulation, splicing/sQTL prediction, and long-range genomic interactions.
Strong benchmark standing: BioFM matches or surpasses established genomic models including Enformer, SpliceTransformer, and Nucleotide Transformer across these benchmarks.

Technical Details

BioFM is a transformer-based genomic foundation model with approximately 265 million parameters, pretrained in a self-supervised fashion on genomic sequence encoded via the BioToken scheme. BioToken's defining property is that genomic variants and biological region annotations are incorporated into tokens, rather than being supplied only as downstream labels or post-hoc features — an input-level design choice the authors argue is responsible for the model's variant sensitivity. The preprint reports evaluation across four task families: noncoding pathogenicity, expression modulation, sQTL (splicing quantitative trait loci) prediction, and long-range genomic interactions. Across these, BioFM is reported as competitive with or state-of-the-art relative to specialized baselines such as Enformer and SpliceTransformer and to genomic foundation models including the Nucleotide Transformer, and competitive with models up to 7B parameters despite its much smaller size and lower compute footprint. As a preprint (v2, November 2025), these results await peer review and independent replication.

Applications

BioFM targets researchers in regulatory and clinical genomics who need accurate interpretation of genetic variants without access to large-scale compute. Its native handling of variants makes it directly applicable to noncoding pathogenicity assessment, prioritizing variants of uncertain significance, predicting effects on gene expression, identifying splice-altering variants through sQTL prediction, and modeling long-range regulatory interactions. The model's efficiency lowers the barrier for groups that cannot run multi-billion-parameter foundation models, making variant-aware genomic prediction more broadly accessible.

Impact

BioFM contributes to an active debate in genomic AI about whether scale or representation is the binding constraint on performance. By showing that a 265M-parameter model with biologically-informed tokenization can rival models more than 20 times larger, it offers evidence that encoding domain knowledge into the input — here, variants and region annotations via BioToken — can be a more compute-efficient path than simply enlarging the network. As of June 2026, the work remains a preprint and its full influence is still emerging. Notably, the authors state that code and model checkpoints are to be provided, but the m42-health/BioFM GitHub and HuggingFace repositories did not resolve as of June 2026, and the weight license is unconfirmed; the paper itself is released under CC-BY-NC. Prospective users should verify artifact availability and licensing terms before adoption.

Citation

BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models

Preprint

Medvedev, A., et al. (2025) BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models. bioRxiv.

DOI: 10.1101/2025.03.27.645711

Recent citations

Papers that recently cited this model.

JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures
Ariel Larey, Elay Dahan, Amit Bleiweiss, et al.
arXiv.org · Feb 2026
7
GFMBench-API: A Standardized Interface for Benchmarking Genomic Foundation Models
Ariel Larey, Elay Dahan, Amit Bleiweiss, et al.
bioRxiv · Feb 2026
4
The DNA dialect: a comprehensive guide to pretrained genomic language models
Marcell Veiner, Fran Supek
Molecular Systems Biology · Jan 2026
2

Top citations

The most-cited papers that cite this model.

JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures
Ariel Larey, Elay Dahan, Amit Bleiweiss, et al.
arXiv.org · Feb 2026
7
GFMBench-API: A Standardized Interface for Benchmarking Genomic Foundation Models
Ariel Larey, Elay Dahan, Amit Bleiweiss, et al.
bioRxiv · Feb 2026
4
Fluctuation structure predicts genome-wide perturbation outcomes
Benjamin Kuznets-Speck, Leon Schwartz, Hanxiao Sun, et al.
bioRxiv · Jul 2025
4
The DNA dialect: a comprehensive guide to pretrained genomic language models
Marcell Veiner, Fran Supek
Molecular Systems Biology · Jan 2026
2
GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling
Jaskaran Singh, Prabhav Sanga, A. K. Dubey, et al.
0

Citations

Total Citations5

Influential0

References59

Fields of citing research

Biology100%
Computer Science100%
Medicine40%

Share of papers citing this model.

Openness

bio.rodeo opennessClosed · low usability and reproducibility

4Closed

Usability — can I run it?7

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

Research Paper

Key Features

BioToken tokenization: A biologically-informed tokenization framework that encodes genomic variants and biological region annotations into the token representation, giving the model explicit access to variant identity and functional context rather than raw sequence alone.

Compute efficiency: At 265M parameters, BioFM reaches competitive or state-of-the-art accuracy against models up to 7B parameters while using substantially less training and inference compute.

Variant-centric design: Because variants are represented natively through BioToken, the model is well suited to variant-effect tasks such as noncoding pathogenicity prediction, where many sequence models struggle.

Broad task coverage: A single pretrained backbone is evaluated across noncoding pathogenicity, gene expression modulation, splicing/sQTL prediction, and long-range genomic interactions.

Strong benchmark standing: BioFM matches or surpasses established genomic models including Enformer, SpliceTransformer, and Nucleotide Transformer across these benchmarks.

Technical Details

Applications

Impact

Citation

BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models

Preprint

Medvedev, A., et al. (2025) BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models. bioRxiv.

DOI: 10.1101/2025.03.27.645711

BioFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models

Recent citations

Top citations

GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling

Citations

Fields of citing research

Openness

Resources

BioFM

#Key Features

#Technical Details

#Applications

#Impact

Citation

BioToken and BioFM – Biologically-Informed Tokenization Enables Accurate and Efficient Genomic Foundation Models

Recent citations

Top citations

GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling

Citations

Fields of citing research

Openness

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact