bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
ProteinLanguage model

AMix-2

Shanghai AI Laboratory / Tsinghua University / Fudan University / City University of Hong Kong / Chinese University of Hong Kong, Shenzhen

A protein-text foundation model embedding sequences and natural language in a shared token space, enabling protein understanding and de novo design from one checkpoint.

Released: May 2026

AMix-2 is a protein–text foundation model that places amino-acid sequences and natural-language descriptions into a single shared token space, allowing one pretrained checkpoint to handle both protein understanding (answering questions about a protein, predicting its function or fold) and protein generation (designing new sequences conditioned on a textual specification). Rather than training separate encoders for protein and language and stitching them together, AMix-2 treats both modalities as sequences of tokens consumed by a unified backbone, so the same model can read a protein, read a prompt about it, and write a new protein in response.

The model was introduced in a 2026 preprint (arXiv:2605.30963) by Keyue Qiu, Yixin Wu, Lihao Wang and colleagues at Shanghai AI Laboratory, in collaboration with groups at Tsinghua University (GenSI / AIR), Fudan University, City University of Hong Kong, and the Chinese University of Hong Kong, Shenzhen. It sits alongside a growing class of multimodal protein–language systems (for example ProtST, ProtLLM, and instruction-tuned protein models) but distinguishes itself by using a single diffusion language-model backbone for both directions of the task rather than a retrieval or adapter-based bridge.

To measure these capabilities, the authors also introduce ProteinArena, a benchmark suite spanning protein question answering, enzyme-commission (EC) prediction, CATH fold classification, and function-conditioned sequence design. The paper states that both AMix-2 and ProteinArena will be released.

#Key Features

  • Shared protein–text token space: Sequences and natural language are encoded in one vocabulary, so understanding and generation are handled by the same model without modality-specific bridges.
  • Block-wise diffusion language-model backbone: AMix-2 uses a block-wise diffusion LM rather than a purely autoregressive decoder, which supports flexible, non-left-to-right generation of protein sequences.
  • Single checkpoint, dual capability: One pretrained model performs both protein analysis (QA, function and fold prediction) and de novo sequence design, reducing the need for task-specific fine-tuned variants.
  • Zero- and few-shot evaluation: The model is assessed in zero- and few-shot settings on ProteinArena, probing how well its pretraining transfers without extensive task-specific tuning.
  • Function-conditioned design: Sequences can be generated conditioned on a textual description of desired function, linking natural-language intent to protein output.

#Technical Details

AMix-2 is built on a block-wise diffusion language-model backbone that operates over a unified token space covering both amino-acid sequences and natural language. Pretraining draws on UniRef50 for protein sequences together with text and annotations from UniProtKB, Swiss-Prot, and InterPro, aligning sequence content with the descriptions, functions, and family/domain annotations attached to those entries. Evaluation is carried out on the accompanying ProteinArena benchmark, which covers protein question answering, EC-number prediction, CATH fold classification, and function-conditioned design, with results reported in zero- and few-shot regimes. The exact parameter count and the number of released model sizes are not stated in the preprint, so the model's scale should be treated as unspecified pending the official release.

#Applications

By unifying understanding and design, AMix-2 is aimed at protein engineers and computational biologists who want to move directly from a natural-language specification of a desired protein to candidate sequences, or to query an unfamiliar protein for its likely function and fold. Potential use cases include function-conditioned de novo design, rapid functional annotation of uncharacterized sequences, enzyme-class prediction, and fold-aware hypothesis generation, all from a single model rather than a pipeline of specialized tools. Such candidates would still require downstream structural and wet-lab validation before experimental use.

#Impact

AMix-2 contributes to the trend of treating protein modeling as a multimodal language problem, where one foundation model spans both interpreting and generating biological sequences. Its paired release with the ProteinArena benchmark could give the community a common yardstick for protein understanding and function-conditioned design across QA, EC prediction, and fold classification. As of this writing the work is a preprint and has not yet been peer reviewed; pretrained weights are not yet publicly downloadable and a runnable code repository has not yet been posted (the linked project page is currently static HTML only), so independent reproduction and adoption will depend on the promised release of the model and benchmark.

Citation

Preprint

DOI: 10.48550/arXiv.2605.30963

DOI: 10.48550/arXiv.2605.30963

Openness

Unclassified
Restrictive license on core components

Tags

diffusionfold_classificationfoundation_modelfunction_predictionlanguage_modelmultimodalprotein_designproteomicszero_shot

Resources

GitHub RepositoryResearch PaperOfficial Website