An enhanced T5-based encoder-decoder that unifies molecule, protein, and text understanding via IUPAC integration and multi-task instruction tuning.
BioT5+ is a generalized biological language model that jointly processes molecules, proteins, and natural language text within a single encoder-decoder framework. Published in the Findings of ACL 2024, it extends the earlier BioT5 model (EMNLP 2023) by addressing two concrete limitations: an inability to interpret IUPAC chemical names — the systematic nomenclature used throughout the biochemical literature — and poor generalization across heterogeneous task types. The result is a model capable of seamlessly switching between classifying molecular properties, predicting numerical quantities such as binding affinities, and generating molecular structures or protein descriptions on demand.
The model was developed by Qizhi Pei, Lijun Wu, and colleagues at Microsoft Research Asia and Peking University, building on the broader trend of cross-modal biological foundation models that bridge the gap between molecular representations and human-readable scientific text. Where most protein language models treat sequence as the sole input modality and most cheminformatics models operate purely on structural representations, BioT5+ treats the problem as a unified text-to-text task, reformulating every biological question — regardless of input type — as a sequence generation problem.
BioT5+ achieved first place in the Text-based Molecule Generation track and second place in the Molecular Captioning track at the Language + Molecules @ ACL 2024 shared task competition, providing independent validation of its cross-modal capabilities against specialized competing systems.
<p> prefix tag to distinguish them from standard text, while molecules use the SELFIES token set. This explicit modality tagging allows the shared encoder-decoder to handle heterogeneous inputs without confusion between alphabetic protein sequences and standard English words.BioT5+ is built on the T5-v1.1-base architecture, an encoder-decoder transformer with approximately 252 million parameters. The encoder ingests mixed-modality input sequences — combinations of natural language, SELFIES molecular representations, IUPAC names, and protein FASTA sequences — while the decoder autoregressively generates outputs in any of these formats. The vocabulary is extended beyond standard T5 to include SELFIES tokens and single-amino-acid protein tokens, each namespace clearly delimited by special prefix markers.
Pre-training is organized into eight tasks distributed across four categories: modality-specific T5 objectives (molecule SELFIES with paired IUPAC names, standalone molecule SELFIES, protein FASTA sequences, and general text), T5 objectives on wrapped biological text, T5 objectives on bioRxiv literature, and bidirectional translation tasks between molecule SELFIES and text as well as between protein FASTA and text. Training data is drawn from bioRxiv preprints (covering biological literature at scale), PubChem (providing molecule-text pairs with IUPAC annotations), and protein sequence databases. This multi-source, multi-objective pre-training strategy enables the model to learn consistent representations across modalities before downstream instruction tuning is applied.
Downstream fine-tuning reframes each task as a text-to-text problem with a natural language instruction prefix, consistent with the T5 text-to-text paradigm. This includes tasks such as drug-target interaction (DTI) prediction, molecular property prediction (e.g., BBBP, HIV, SIDER), molecule captioning (ChEBI-20), and text-guided molecule generation.
BioT5+ is designed for researchers working at the interface of cheminformatics, protein science, and natural language processing. Drug discovery teams can use the model to predict molecular properties from text descriptions, generate candidate molecules from textual specifications, or retrieve structured information about compounds referenced by IUPAC name in the literature. Computational biologists can apply the drug-target interaction capabilities to prioritize protein-ligand pairs before expensive wet-lab assays. The model's instruction-following design makes it particularly accessible for rapid prototyping: a single fine-tuned checkpoint can be queried with different instruction prefixes to address classification, regression, or generation tasks without modifying the model architecture or retraining from scratch. Fine-tuned checkpoints for specific tasks such as ChEBI-20 molecule captioning, DTI prediction on BIOSNAP, and the Mol-Instructions protein task suite are available on HuggingFace.
BioT5+ represents a meaningful step toward unified biological foundation models that treat language, molecular structures, and protein sequences as interoperable modalities within a single framework. Its publication at ACL 2024 Findings signals growing recognition in the NLP community that biological sequences are a first-class language domain deserving of state-of-the-art text generation methods. The competitive results at the Language + Molecules @ ACL 2024 shared task — securing first place in molecule generation against specialized systems — demonstrate that a generalist multi-task model can match or exceed purpose-built approaches on well-defined benchmarks. A key limitation is model scale: the base-size T5 backbone (~252M parameters) is modest compared to larger biological language models, and performance on tasks requiring deep structural reasoning may be constrained by this capacity. The framework is nonetheless well-suited for resource-constrained settings and serves as a practical foundation for teams seeking a single model spanning molecular and textual biological domains.