Pre-training framework bridging molecules, proteins, and natural language using T5 with SELFIES representations for cross-modal biological understanding.
BioT5 is a pre-training framework designed to unify the representation of molecules, proteins, and natural language within a single encoder-decoder architecture. Developed by Qizhi Pei, Rui Yan, and collaborators at Renmin University of China and Microsoft Research, and published at EMNLP 2023, the model addresses a fundamental challenge in computational drug discovery: most prior approaches treat molecules, proteins, and biological text as separate modalities that cannot easily be combined or reasoned over jointly.
The central insight of BioT5 is that rich contextual associations already exist between these modalities in the scientific literature — papers, abstracts, and databases routinely pair molecule identifiers with natural language descriptions, and protein names with functional annotations. Rather than building separate specialized models for each modality, BioT5 learns unified representations by training on four categories of data simultaneously: single-modal sequences (molecules, proteins, text), biologically annotated text in which molecular and protein entities are detected and wrapped with SELFIES or FASTA notation, paired molecule-description and protein-description examples, and general biomedical text from PubMed Central and bioRxiv.
A follow-up model, BioT5+ (ACL 2024 Findings, arXiv:2402.17810), extends the original with IUPAC name integration, numerical tokenization, and multi-task instruction tuning across 15 task types and 21 benchmark datasets, further improving generalization and achieving top rankings at the Language + Molecules @ ACL 2024 shared task.
BioT5 is built on the T5 encoder-decoder transformer architecture with approximately 220 million parameters in the base configuration. The tokenizer is extended to handle SELFIES strings and protein FASTA sequences alongside natural language, enabling a single vocabulary to represent all three modalities. Pre-training uses four data streams: molecule SELFIES with IUPAC names from PubChem and ZINC20; protein sequences from UniRef50; general English text from the Colossal Clean Crawled Corpus (C4); and biomedical text from PubMed Central full-text articles and bioRxiv preprints. Entity-aware wrapped text is constructed by running the BERN2 NER system over PubMed abstracts to identify and substitute molecular and protein mentions with their SELFIES or FASTA representations respectively.
Pre-training applies three objectives: (1) the standard T5 span-corruption objective applied independently to each modality; (2) the same objective applied to entity-enriched biological text; and (3) a cross-modal translation objective on structured molecule-text and protein-text pairs. On downstream benchmarks, BioT5 outperforms strong baselines on drug-target interaction prediction (BioSNAP, BindingDB) as measured by AUROC and AUPRC, and achieves substantially higher BLEU and exact-match scores than prior models on the ChEBI-20 molecule captioning and text-to-molecule benchmarks, with the SELFIES representation enabling 100% valid molecule outputs by construction.
BioT5 is applicable across the early stages of the drug discovery pipeline. Medicinal chemists and computational biologists can use the molecule captioning capability to generate natural language descriptions of novel compounds, aiding in understanding structure-activity relationships. The text-to-molecule generation mode enables retrieval or design of candidate molecules from free-text descriptions of desired properties. For target engagement studies, the drug-target interaction prediction fine-tuned variant (available separately on HuggingFace) can screen compound-protein pairs. Researchers working on protein function annotation can leverage the protein-text alignment for property prediction and functional similarity tasks. The model's unified vocabulary also makes it well-suited for multi-step workflows that require reasoning across molecular structures and textual knowledge simultaneously.
BioT5 established an early strong baseline for unified cross-modal representation learning in biology, demonstrating that a single pre-trained model could compete with specialized architectures across structurally diverse tasks. Its adoption of SELFIES over SMILES influenced subsequent work on generative molecular models to prioritize validity-by-construction representations. The follow-up BioT5+ model, published in ACL 2024 Findings, extended the framework substantially and achieved first place in the text-based molecule generation track at the Language + Molecules @ ACL 2024 competition. The codebase (MIT-licensed) and multiple fine-tuned model checkpoints are publicly available on GitHub and HuggingFace, supporting reuse and fine-tuning by the wider community. A notable limitation of the base model is that it does not explicitly model three-dimensional molecular geometry, which restricts its accuracy for tasks where shape and binding conformation are critical; structure-aware extensions remain an open direction.