BioMatrix is a multimodal biological foundation model that natively integrates 1D sequences, 3D structures, and natural language for both small molecules and proteins within a single decoder-only architecture. Most prior biological foundation models specialize in one modality (e.g., protein language models such as ESM, or molecular sequence models) and bolt on additional modalities through external encoders, projection adapters, or modality-specific output heads. BioMatrix instead casts every modality into a shared discrete token space and trains a single next-token-prediction objective over the entire "modality matrix," removing the architectural seams that typically separate sequence, structure, and text processing.

The model was developed by researchers at Shanghai AI Laboratory and the Gaoling School of Artificial Intelligence at Renmin University of China, led by first author Qizhi Pei with senior author Lijun Wu, and released as an arXiv preprint in June 2026. It is built on the Qwen3 language model backbone, offered in 1.7B and 4B parameter sizes, and continually pretrained on 304.4 billion tokens spanning general and domain-specific text, molecular and protein data in both 1D and 3D forms, and cross-modal corpora.

By unifying tokenization across modalities, BioMatrix supports cross-modal generation tasks—such as converting between a protein sequence and its structure, or between a molecule and a textual description—within one model, while reporting state-of-the-art or competitive results on 77 of 80 downstream tasks. As a fresh preprint without a peer-reviewed venue, its results should be read as author-reported.

Key Features

Unified discrete tokenization: All modalities—SMILES/SELFIES strings, protein sequences, 3D molecular and protein structures, and natural language—are mapped into a single token vocabulary, so no external encoders, projection adapters, or modality-specific output heads are required.
Structure tokenizers: 3D structures are quantized into discrete tokens via MolStructTok (a 512-entry codebook) for molecules and a GCP-VQVAE (a 4,096-entry codebook) for proteins, yielding roughly 11,294 added structural tokens in the extended Qwen3 vocabulary.
Cross-modal generation: Because every modality shares one autoregressive objective, the model can translate across sequence, structure, and language in either direction, including structure prediction and structure-conditioned generation.
Four released checkpoints: 1.7B and 4B variants are each available as a continual-pretraining-only (CPT) checkpoint and an instruction-tuned (SFT) checkpoint, with the SFT variants recommended for practical use.

Technical Details

BioMatrix uses a decoder-only transformer based on Qwen3-1.7B-Base and Qwen3-4B-Base, with the 4B-SFT checkpoint supporting an 8,192-token context length. Continual pretraining covers 304.4 billion tokens across text, molecular and protein 1D/3D data, and cross-modal pairs, followed by instruction tuning over 80 downstream tasks (generation, name conversion, property prediction, captioning, folding, and binding-affinity estimation). The instruction-tuning corpus (BioMatrix-SFT) comprises roughly 23.6 million examples drawn from sources including SMolInstruct, MoleculeQA, OpenMolIns, DPLM-2, and PDBBind. Across the 80 evaluation tasks in six categories spanning molecules, proteins, and their interactions, the authors report state-of-the-art or competitive performance on 77, without relying on modality-specific architectural components.

Applications

BioMatrix targets computational chemists, structural biologists, and drug-discovery researchers who otherwise juggle separate specialized models for molecular property prediction, protein structure prediction, and biomedical text understanding. A single model handles molecule generation and captioning, protein folding and inverse folding, property and binding-affinity prediction, and cross-modal translation, which simplifies pipelines that combine small-molecule and protein reasoning—for example, structure-aware molecule design or protein-ligand interaction analysis. The Apache-2.0 license and four open checkpoints make it accessible for both direct use and downstream fine-tuning.

Impact

BioMatrix is positioned as the first foundation model to natively span the full "modality matrix" of sequences, structures, and language across both small molecules and proteins in one decoder-only model, demonstrating that a unified token space can match or exceed adapter-based and specialized approaches on a broad task suite. If the reported breadth holds up under independent evaluation, the design points toward a simpler recipe for multimodal biological modeling—scaling one autoregressive objective rather than stitching together modality-specific encoders. As a recent preprint, its real-world adoption and reproducibility remain to be established, and the authors note limitations on complex structures and domain coverage.

Citation

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Preprint

Pei, Q., et al. (2026) BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language. arXiv.

DOI: 10.48550/arXiv.2606.22138

Key Features

Unified discrete tokenization: All modalities—SMILES/SELFIES strings, protein sequences, 3D molecular and protein structures, and natural language—are mapped into a single token vocabulary, so no external encoders, projection adapters, or modality-specific output heads are required.

Structure tokenizers: 3D structures are quantized into discrete tokens via MolStructTok (a 512-entry codebook) for molecules and a GCP-VQVAE (a 4,096-entry codebook) for proteins, yielding roughly 11,294 added structural tokens in the extended Qwen3 vocabulary.

Cross-modal generation: Because every modality shares one autoregressive objective, the model can translate across sequence, structure, and language in either direction, including structure prediction and structure-conditioned generation.

Four released checkpoints: 1.7B and 4B variants are each available as a continual-pretraining-only (CPT) checkpoint and an instruction-tuned (SFT) checkpoint, with the SFT variants recommended for practical use.

Technical Details

Applications

Impact

Citation

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Preprint

Pei, Q., et al. (2026) BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language. arXiv.

DOI: 10.48550/arXiv.2606.22138

BioMatrix

Key Features

Technical Details

Applications

Impact

Citation

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Recent citations

Top citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

BioMatrix

Key Features

Technical Details

Applications

Impact

Citation

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Recent citations

Top citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

BioMatrix

#Key Features

#Technical Details

#Applications

#Impact

Citation

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Recent citations

Top citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

BioMatrix

#Key Features

#Technical Details

#Applications

#Impact

Citation

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Recent citations

Top citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact