Overview

BioT5+ is a generalized biological language model that jointly processes molecules, proteins, and natural language text within a single encoder-decoder framework. Published in the Findings of ACL 2024, it extends the earlier BioT5 model (EMNLP 2023) by addressing two concrete limitations: an inability to interpret IUPAC chemical names — the systematic nomenclature used throughout the biochemical literature — and poor generalization across heterogeneous task types. The result is a model capable of seamlessly switching between classifying molecular properties, predicting numerical quantities such as binding affinities, and generating molecular structures or protein descriptions on demand.

The model was developed by Qizhi Pei, Lijun Wu, and colleagues at Microsoft Research Asia and Peking University, building on the broader trend of cross-modal biological foundation models that bridge the gap between molecular representations and human-readable scientific text. Where most protein language models treat sequence as the sole input modality and most cheminformatics models operate purely on structural representations, BioT5+ treats the problem as a unified text-to-text task, reformulating every biological question — regardless of input type — as a sequence generation problem.

BioT5+ achieved first place in the Text-based Molecule Generation track and second place in the Molecular Captioning track at the Language + Molecules @ ACL 2024 shared task competition, providing independent validation of its cross-modal capabilities against specialized competing systems.

Key Features

IUPAC name integration: BioT5+ incorporates IUPAC chemical nomenclature directly into pre-training, allowing the model to map systematic chemical names to molecular structures (SELFIES) and vice versa. This closes a gap that limited earlier models, which were trained only on SMILES or SELFIES representations without connecting them to textual naming conventions used in scientific literature.
Multi-task instruction tuning: The model is fine-tuned with instruction-style prompts across 15 task categories spanning classification, regression, and generation. This approach improves zero-shot and few-shot generalization compared to single-task fine-tuning, producing a single checkpoint that handles diverse downstream requests without task-specific retraining.
Numerical tokenization: A dedicated numerical tokenization strategy improves the model's ability to process and predict numerical values such as IC50, binding affinities, and solubility measurements — quantities that standard subword tokenizers handle poorly by fragmenting digit strings inconsistently.
Unified biological vocabulary: Protein sequences are tokenized at the amino acid level with a <p> prefix tag to distinguish them from standard text, while molecules use the SELFIES token set. This explicit modality tagging allows the shared encoder-decoder to handle heterogeneous inputs without confusion between alphabetic protein sequences and standard English words.
Broad benchmark coverage: BioT5+ was evaluated on 21 benchmark datasets covering 3 problem types and 15 task categories, achieving state-of-the-art results in the majority of cases and providing one of the most comprehensive cross-modal evaluations in the field at the time of publication.

Technical Details

BioT5+ is built on the T5-v1.1-base architecture, an encoder-decoder transformer with approximately 252 million parameters. The encoder ingests mixed-modality input sequences — combinations of natural language, SELFIES molecular representations, IUPAC names, and protein FASTA sequences — while the decoder autoregressively generates outputs in any of these formats. The vocabulary is extended beyond standard T5 to include SELFIES tokens and single-amino-acid protein tokens, each namespace clearly delimited by special prefix markers.

Pre-training is organized into eight tasks distributed across four categories: modality-specific T5 objectives (molecule SELFIES with paired IUPAC names, standalone molecule SELFIES, protein FASTA sequences, and general text), T5 objectives on wrapped biological text, T5 objectives on bioRxiv literature, and bidirectional translation tasks between molecule SELFIES and text as well as between protein FASTA and text. Training data is drawn from bioRxiv preprints (covering biological literature at scale), PubChem (providing molecule-text pairs with IUPAC annotations), and protein sequence databases. This multi-source, multi-objective pre-training strategy enables the model to learn consistent representations across modalities before downstream instruction tuning is applied.

Downstream fine-tuning reframes each task as a text-to-text problem with a natural language instruction prefix, consistent with the T5 text-to-text paradigm. This includes tasks such as drug-target interaction (DTI) prediction, molecular property prediction (e.g., BBBP, HIV, SIDER), molecule captioning (ChEBI-20), and text-guided molecule generation.

Applications

BioT5+ is designed for researchers working at the interface of cheminformatics, protein science, and natural language processing. Drug discovery teams can use the model to predict molecular properties from text descriptions, generate candidate molecules from textual specifications, or retrieve structured information about compounds referenced by IUPAC name in the literature. Computational biologists can apply the drug-target interaction capabilities to prioritize protein-ligand pairs before expensive wet-lab assays. The model's instruction-following design makes it particularly accessible for rapid prototyping: a single fine-tuned checkpoint can be queried with different instruction prefixes to address classification, regression, or generation tasks without modifying the model architecture or retraining from scratch. Fine-tuned checkpoints for specific tasks such as ChEBI-20 molecule captioning, DTI prediction on BIOSNAP, and the Mol-Instructions protein task suite are available on HuggingFace.

Impact

BioT5+ represents a meaningful step toward unified biological foundation models that treat language, molecular structures, and protein sequences as interoperable modalities within a single framework. Its publication at ACL 2024 Findings signals growing recognition in the NLP community that biological sequences are a first-class language domain deserving of state-of-the-art text generation methods. The competitive results at the Language + Molecules @ ACL 2024 shared task — securing first place in molecule generation against specialized systems — demonstrate that a generalist multi-task model can match or exceed purpose-built approaches on well-defined benchmarks. A key limitation is model scale: the base-size T5 backbone (~252M parameters) is modest compared to larger biological language models, and performance on tasks requiring deep structural reasoning may be constrained by this capacity. The framework is nonetheless well-suited for resource-constrained settings and serves as a practical foundation for teams seeking a single model spanning molecular and textual biological domains.

Overview

Key Features

IUPAC name integration: BioT5+ incorporates IUPAC chemical nomenclature directly into pre-training, allowing the model to map systematic chemical names to molecular structures (SELFIES) and vice versa. This closes a gap that limited earlier models, which were trained only on SMILES or SELFIES representations without connecting them to textual naming conventions used in scientific literature.

Multi-task instruction tuning: The model is fine-tuned with instruction-style prompts across 15 task categories spanning classification, regression, and generation. This approach improves zero-shot and few-shot generalization compared to single-task fine-tuning, producing a single checkpoint that handles diverse downstream requests without task-specific retraining.

Numerical tokenization: A dedicated numerical tokenization strategy improves the model's ability to process and predict numerical values such as IC50, binding affinities, and solubility measurements — quantities that standard subword tokenizers handle poorly by fragmenting digit strings inconsistently.

Unified biological vocabulary: Protein sequences are tokenized at the amino acid level with a <p> prefix tag to distinguish them from standard text, while molecules use the SELFIES token set. This explicit modality tagging allows the shared encoder-decoder to handle heterogeneous inputs without confusion between alphabetic protein sequences and standard English words.

Broad benchmark coverage: BioT5+ was evaluated on 21 benchmark datasets covering 3 problem types and 15 task categories, achieving state-of-the-art results in the majority of cases and providing one of the most comprehensive cross-modal evaluations in the field at the time of publication.

Technical Details

Applications

Impact

BioT5+

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

HuggingFace

Tags

Resources

BioT5+

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

HuggingFace

Tags

Resources