AMix-2

Shanghai AI Laboratory / Tsinghua University / Fudan University / City University of Hong Kong / Chinese University of Hong Kong, Shenzhen

Protein-text foundation model placing amino acid sequences and natural language in one token space for protein understanding and de novo design.

Released: May 2026

AMix-2 is a protein–text foundation model that places amino-acid sequences and natural-language descriptions into a single shared token space, allowing one pretrained checkpoint to handle both protein understanding (answering questions about a protein, predicting its function or fold) and protein generation (designing new sequences conditioned on a textual specification). Rather than training separate encoders for protein and language and stitching them together, AMix-2 treats both modalities as sequences of tokens consumed by a unified backbone, so the same model can read a protein, read a prompt about it, and write a new protein in response.

The model was introduced in a 2026 preprint (arXiv:2605.30963) by Keyue Qiu, Yixin Wu, Lihao Wang and colleagues at Shanghai AI Laboratory, in collaboration with groups at Tsinghua University (GenSI / AIR), Fudan University, City University of Hong Kong, and the Chinese University of Hong Kong, Shenzhen. It sits alongside a growing class of multimodal protein–language systems (for example ProtST, ProtLLM, and instruction-tuned protein models) but distinguishes itself by using a single diffusion language-model backbone for both directions of the task rather than a retrieval or adapter-based bridge.

To measure these capabilities, the authors also introduce ProteinArena, a benchmark suite spanning protein question answering, enzyme-commission (EC) prediction, CATH fold classification, and function-conditioned sequence design. The paper states that both AMix-2 and ProteinArena will be released.

Key Features

Shared protein–text token space: Sequences and natural language are encoded in one vocabulary, so understanding and generation are handled by the same model without modality-specific bridges.
Block-wise diffusion language-model backbone: AMix-2 uses a block-wise diffusion LM rather than a purely autoregressive decoder, which supports flexible, non-left-to-right generation of protein sequences.
Single checkpoint, dual capability: One pretrained model performs both protein analysis (QA, function and fold prediction) and de novo sequence design, reducing the need for task-specific fine-tuned variants.
Zero- and few-shot evaluation: The model is assessed in zero- and few-shot settings on ProteinArena, probing how well its pretraining transfers without extensive task-specific tuning.
Function-conditioned design: Sequences can be generated conditioned on a textual description of desired function, linking natural-language intent to protein output.

Technical Details

AMix-2 is built on a block-wise diffusion language-model backbone that operates over a unified token space covering both amino-acid sequences and natural language. Pretraining draws on UniRef50 for protein sequences together with text and annotations from UniProtKB, Swiss-Prot, and InterPro, aligning sequence content with the descriptions, functions, and family/domain annotations attached to those entries. Evaluation is carried out on the accompanying ProteinArena benchmark, which covers protein question answering, EC-number prediction, CATH fold classification, and function-conditioned design, with results reported in zero- and few-shot regimes. The exact parameter count and the number of released model sizes are not stated in the preprint, so the model's scale should be treated as unspecified pending the official release.

Applications

By unifying understanding and design, AMix-2 is aimed at protein engineers and computational biologists who want to move directly from a natural-language specification of a desired protein to candidate sequences, or to query an unfamiliar protein for its likely function and fold. Potential use cases include function-conditioned de novo design, rapid functional annotation of uncharacterized sequences, enzyme-class prediction, and fold-aware hypothesis generation, all from a single model rather than a pipeline of specialized tools. Such candidates would still require downstream structural and wet-lab validation before experimental use.

Impact

AMix-2 contributes to the trend of treating protein modeling as a multimodal language problem, where one foundation model spans both interpreting and generating biological sequences. Its paired release with the ProteinArena benchmark could give the community a common yardstick for protein understanding and function-conditioned design across QA, EC prediction, and fold classification. As of this writing the work is a preprint and has not yet been peer reviewed; pretrained weights are not yet publicly downloadable and a runnable code repository has not yet been posted (the linked project page is currently static HTML only), so independent reproduction and adoption will depend on the promised release of the model and benchmark.

Citation

AMix-2: Establishing Protein as a Native Modality in Large Language Models

Preprint

Qiu, K., et al. (2026) AMix-2: Establishing Protein as a Native Modality in Large Language Models.

DOI: 10.48550/arXiv.2605.30963

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References70

GitHub

Stars0

Forks0

Open Issues0

Contributors1

Last Push1mo ago

LanguageJavaScript

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

10Closed

Usability — can I run it?7

Reproducibility — can I retrain it?14

Model Openness Framework

Unclassified

Restrictive license on core components

Resources

GitHub Repository Research Paper Official Website

Key Features

Shared protein–text token space: Sequences and natural language are encoded in one vocabulary, so understanding and generation are handled by the same model without modality-specific bridges.

Block-wise diffusion language-model backbone: AMix-2 uses a block-wise diffusion LM rather than a purely autoregressive decoder, which supports flexible, non-left-to-right generation of protein sequences.

Single checkpoint, dual capability: One pretrained model performs both protein analysis (QA, function and fold prediction) and de novo sequence design, reducing the need for task-specific fine-tuned variants.

Zero- and few-shot evaluation: The model is assessed in zero- and few-shot settings on ProteinArena, probing how well its pretraining transfers without extensive task-specific tuning.

Function-conditioned design: Sequences can be generated conditioned on a textual description of desired function, linking natural-language intent to protein output.

Technical Details

Applications

Impact

AMix-2

Key Features

Technical Details

Applications

Impact

Citation

AMix-2: Establishing Protein as a Native Modality in Large Language Models

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

AMix-2

Key Features

Technical Details

Applications

Impact

Citation

AMix-2: Establishing Protein as a Native Modality in Large Language Models

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

AMix-2

#Key Features

#Technical Details

#Applications

#Impact

Citation

AMix-2: Establishing Protein as a Native Modality in Large Language Models

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

AMix-2

#Key Features

#Technical Details

#Applications

#Impact

Citation

AMix-2: Establishing Protein as a Native Modality in Large Language Models

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact