Shanghai AI Laboratory / Tsinghua University / Fudan University / City University of Hong Kong / Chinese University of Hong Kong, Shenzhen
A protein-text foundation model embedding sequences and natural language in a shared token space, enabling protein understanding and de novo design from one checkpoint.
AMix-2 is a protein–text foundation model that places amino-acid sequences and natural-language descriptions into a single shared token space, allowing one pretrained checkpoint to handle both protein understanding (answering questions about a protein, predicting its function or fold) and protein generation (designing new sequences conditioned on a textual specification). Rather than training separate encoders for protein and language and stitching them together, AMix-2 treats both modalities as sequences of tokens consumed by a unified backbone, so the same model can read a protein, read a prompt about it, and write a new protein in response.
The model was introduced in a 2026 preprint (arXiv:2605.30963) by Keyue Qiu, Yixin Wu, Lihao Wang and colleagues at Shanghai AI Laboratory, in collaboration with groups at Tsinghua University (GenSI / AIR), Fudan University, City University of Hong Kong, and the Chinese University of Hong Kong, Shenzhen. It sits alongside a growing class of multimodal protein–language systems (for example ProtST, ProtLLM, and instruction-tuned protein models) but distinguishes itself by using a single diffusion language-model backbone for both directions of the task rather than a retrieval or adapter-based bridge.
To measure these capabilities, the authors also introduce ProteinArena, a benchmark suite spanning protein question answering, enzyme-commission (EC) prediction, CATH fold classification, and function-conditioned sequence design. The paper states that both AMix-2 and ProteinArena will be released.
AMix-2 is built on a block-wise diffusion language-model backbone that operates over a unified token space covering both amino-acid sequences and natural language. Pretraining draws on UniRef50 for protein sequences together with text and annotations from UniProtKB, Swiss-Prot, and InterPro, aligning sequence content with the descriptions, functions, and family/domain annotations attached to those entries. Evaluation is carried out on the accompanying ProteinArena benchmark, which covers protein question answering, EC-number prediction, CATH fold classification, and function-conditioned design, with results reported in zero- and few-shot regimes. The exact parameter count and the number of released model sizes are not stated in the preprint, so the model's scale should be treated as unspecified pending the official release.
By unifying understanding and design, AMix-2 is aimed at protein engineers and computational biologists who want to move directly from a natural-language specification of a desired protein to candidate sequences, or to query an unfamiliar protein for its likely function and fold. Potential use cases include function-conditioned de novo design, rapid functional annotation of uncharacterized sequences, enzyme-class prediction, and fold-aware hypothesis generation, all from a single model rather than a pipeline of specialized tools. Such candidates would still require downstream structural and wet-lab validation before experimental use.
AMix-2 contributes to the trend of treating protein modeling as a multimodal language problem, where one foundation model spans both interpreting and generating biological sequences. Its paired release with the ProteinArena benchmark could give the community a common yardstick for protein understanding and function-conditioned design across QA, EC prediction, and fold classification. As of this writing the work is a preprint and has not yet been peer reviewed; pretrained weights are not yet publicly downloadable and a runnable code repository has not yet been posted (the linked project page is currently static HTML only), so independent reproduction and adoption will depend on the promised release of the model and benchmark.