Overview

BioT5 is a pre-training framework designed to unify the representation of molecules, proteins, and natural language within a single encoder-decoder architecture. Developed by Qizhi Pei, Rui Yan, and collaborators at Renmin University of China and Microsoft Research, and published at EMNLP 2023, the model addresses a fundamental challenge in computational drug discovery: most prior approaches treat molecules, proteins, and biological text as separate modalities that cannot easily be combined or reasoned over jointly.

The central insight of BioT5 is that rich contextual associations already exist between these modalities in the scientific literature — papers, abstracts, and databases routinely pair molecule identifiers with natural language descriptions, and protein names with functional annotations. Rather than building separate specialized models for each modality, BioT5 learns unified representations by training on four categories of data simultaneously: single-modal sequences (molecules, proteins, text), biologically annotated text in which molecular and protein entities are detected and wrapped with SELFIES or FASTA notation, paired molecule-description and protein-description examples, and general biomedical text from PubMed Central and bioRxiv.

A follow-up model, BioT5+ (ACL 2024 Findings, arXiv:2402.17810), extends the original with IUPAC name integration, numerical tokenization, and multi-task instruction tuning across 15 task types and 21 benchmark datasets, further improving generalization and achieving top rankings at the Language + Molecules @ ACL 2024 shared task.

Key Features

SELFIES molecular representation: Instead of SMILES strings — which can produce chemically invalid outputs during generation — BioT5 adopts SELFIES (Self-Referencing Embedded Strings), a representation that guarantees 100% syntactically and semantically valid molecules. This directly eliminates a major failure mode in molecule generation tasks.
Structured vs. unstructured knowledge distinction: BioT5 explicitly differentiates between curated database entries (structured knowledge, such as PubChem compound-description pairs) and mentions in free-form literature (unstructured knowledge, detected with the BERN2 biomedical NER tool). Each type is processed through a tailored pre-training objective, preventing noisier literature data from degrading signal from cleaner paired data.
Cross-modal translation pre-training: A dedicated pre-training objective trains the model to translate molecule SELFIES to natural language descriptions and vice versa, with an analogous objective for proteins. This explicitly aligns molecular and textual embedding spaces rather than relying solely on co-occurrence.
T5 encoder-decoder backbone: The shared encoder-decoder design allows a single model to handle both understanding tasks (classification, regression via encoder outputs) and generation tasks (molecule generation, captioning via autoregressive decoding) without separate model families.
Multi-task fine-tuning breadth: After pre-training, BioT5 is fine-tuned across 15 downstream task types spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction prediction, molecule captioning, and text-based molecule generation.

Technical Details

BioT5 is built on the T5 encoder-decoder transformer architecture with approximately 220 million parameters in the base configuration. The tokenizer is extended to handle SELFIES strings and protein FASTA sequences alongside natural language, enabling a single vocabulary to represent all three modalities. Pre-training uses four data streams: molecule SELFIES with IUPAC names from PubChem and ZINC20; protein sequences from UniRef50; general English text from the Colossal Clean Crawled Corpus (C4); and biomedical text from PubMed Central full-text articles and bioRxiv preprints. Entity-aware wrapped text is constructed by running the BERN2 NER system over PubMed abstracts to identify and substitute molecular and protein mentions with their SELFIES or FASTA representations respectively.

Pre-training applies three objectives: (1) the standard T5 span-corruption objective applied independently to each modality; (2) the same objective applied to entity-enriched biological text; and (3) a cross-modal translation objective on structured molecule-text and protein-text pairs. On downstream benchmarks, BioT5 outperforms strong baselines on drug-target interaction prediction (BioSNAP, BindingDB) as measured by AUROC and AUPRC, and achieves substantially higher BLEU and exact-match scores than prior models on the ChEBI-20 molecule captioning and text-to-molecule benchmarks, with the SELFIES representation enabling 100% valid molecule outputs by construction.

Applications

BioT5 is applicable across the early stages of the drug discovery pipeline. Medicinal chemists and computational biologists can use the molecule captioning capability to generate natural language descriptions of novel compounds, aiding in understanding structure-activity relationships. The text-to-molecule generation mode enables retrieval or design of candidate molecules from free-text descriptions of desired properties. For target engagement studies, the drug-target interaction prediction fine-tuned variant (available separately on HuggingFace) can screen compound-protein pairs. Researchers working on protein function annotation can leverage the protein-text alignment for property prediction and functional similarity tasks. The model's unified vocabulary also makes it well-suited for multi-step workflows that require reasoning across molecular structures and textual knowledge simultaneously.

Impact

BioT5 established an early strong baseline for unified cross-modal representation learning in biology, demonstrating that a single pre-trained model could compete with specialized architectures across structurally diverse tasks. Its adoption of SELFIES over SMILES influenced subsequent work on generative molecular models to prioritize validity-by-construction representations. The follow-up BioT5+ model, published in ACL 2024 Findings, extended the framework substantially and achieved first place in the text-based molecule generation track at the Language + Molecules @ ACL 2024 competition. The codebase (MIT-licensed) and multiple fine-tuned model checkpoints are publicly available on GitHub and HuggingFace, supporting reuse and fine-tuning by the wider community. A notable limitation of the base model is that it does not explicitly model three-dimensional molecular geometry, which restricts its accuracy for tasks where shape and binding conformation are critical; structure-aware extensions remain an open direction.

Overview

Key Features

SELFIES molecular representation: Instead of SMILES strings — which can produce chemically invalid outputs during generation — BioT5 adopts SELFIES (Self-Referencing Embedded Strings), a representation that guarantees 100% syntactically and semantically valid molecules. This directly eliminates a major failure mode in molecule generation tasks.

Structured vs. unstructured knowledge distinction: BioT5 explicitly differentiates between curated database entries (structured knowledge, such as PubChem compound-description pairs) and mentions in free-form literature (unstructured knowledge, detected with the BERN2 biomedical NER tool). Each type is processed through a tailored pre-training objective, preventing noisier literature data from degrading signal from cleaner paired data.

Cross-modal translation pre-training: A dedicated pre-training objective trains the model to translate molecule SELFIES to natural language descriptions and vice versa, with an analogous objective for proteins. This explicitly aligns molecular and textual embedding spaces rather than relying solely on co-occurrence.

T5 encoder-decoder backbone: The shared encoder-decoder design allows a single model to handle both understanding tasks (classification, regression via encoder outputs) and generation tasks (molecule generation, captioning via autoregressive decoding) without separate model families.

Multi-task fine-tuning breadth: After pre-training, BioT5 is fine-tuned across 15 downstream task types spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction prediction, molecule captioning, and text-based molecule generation.

Technical Details

Applications

Impact

BioT5

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources

BioT5

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources