IBM Research
Multi-modal, multi-task biological foundation model trained on 2 billion samples spanning proteins, small molecules, and single-cell gene expression.
MAMMAL (Molecular Aligned Multi-Modal Architecture and Language) is a 458-million-parameter biological foundation model developed by IBM Research and released as a preprint in October 2024. It directly confronts a structural limitation that has constrained the computational drug discovery field: the proliferation of narrow, task-specific models that excel on individual prediction problems but cannot share information across biological modalities or generalize to the diverse multi-step reasoning that drug discovery pipelines actually require. MAMMAL instead implements a single unified architecture that simultaneously handles proteins, small molecules, and single-cell gene expression data within one model, one tokenizer, and one training procedure.
The central technical contribution of MAMMAL is a structured prompt syntax that allows any combination of biological inputs — protein sequences, SMILES strings, expression vectors, or scalar-valued metadata — to be composed into a single tokenized input regardless of the downstream task type. This design enables the model to perform classification, regression, and generation tasks within the same architecture by switching between encoder-only and encoder-decoder computation modes. Crucially, scalar numerical values (such as binding affinities, IC50 values, or expression levels) are embedded as learned projections directly into the model's token space rather than being discretized or appended as fixed vocabulary items, preserving numerical precision across orders of magnitude. Pretraining on two billion samples sourced from six large biological databases across seven distinct task types enables the model to build cross-domain representations that capture relationships between molecular, protein, and transcriptomic spaces.
Evaluated on eleven diverse downstream tasks spanning the drug discovery pipeline, MAMMAL achieves state-of-the-art results on nine and comparable performance on the remaining two — all within a single model checkpoint rather than a collection of specialized models. This breadth of coverage, combined with publicly available weights and fine-tuning code, makes MAMMAL a practical foundation for multi-task biological prediction.
MAMMAL is implemented as a hybrid transformer with 458 million parameters, operating in either encoder-only or encoder-decoder mode depending on the task. The architecture builds on a T5-style encoder-decoder backbone that has been extended with a modular tokenization system accommodating heterogeneous biological inputs. Protein sequences use amino acid character tokenization, small molecules use SMILES character tokenization with special handling for ring closures, and single-cell expression profiles are represented as sets of gene-value pairs with learned per-gene embeddings. Numerical scalar values are projected into the token embedding dimension via a two-layer MLP, allowing continuous quantities to be processed natively without vocabulary expansion.
Pretraining was conducted on two billion samples drawn from six datasets: protein sequences from UniProt, small molecule bioactivity data from ChEMBL and PubChem, antibody-antigen binding data, drug-target interaction pairs, and single-cell expression data from CELLxGENE. Seven distinct pretraining objectives were used, including within-modality masked reconstruction, cross-domain alignment tasks that map molecular representations to associated protein activity data, and scalar regression objectives. On the eleven downstream benchmarks, key reported results include: cell type annotation F1 of 0.763 (a 7.5% improvement over the prior best), drug toxicity prediction AUROC of 0.986 on ClinTox (a 4.0% improvement), antibody CDR-H3 infilling amino acid recovery of 0.446 (a 19% improvement), protein-protein interaction ΔΔG Pearson correlation of 0.852 (a 28.5% improvement), and drug-target interaction NRMSE of 0.906. On antibody-antigen and nanobody-antigen binding classification benchmarks, MAMMAL significantly outperformed AlphaFold 3 in three of four tested targets.
MAMMAL is suited for computational teams working across the drug discovery pipeline who need a single versatile model rather than a portfolio of specialized tools. In target identification, the cell type annotation and gene expression capabilities can help characterize disease-relevant cell populations from single-cell datasets. In hit identification and lead optimization, the drug-target interaction and molecular property prediction capabilities support virtual screening and ADMET profiling. The antibody CDR design and protein-protein interaction affinity modules are directly applicable to biologics development workflows. Because MAMMAL accepts mixed-modality prompts, it is particularly well-suited to tasks that require jointly reasoning over molecular structure and biological context — for example, predicting a compound's activity in a specific cell type rather than in a generic biochemical assay. Researchers can fine-tune the publicly available checkpoint on custom labeled datasets with relatively modest compute given the 458M parameter scale.
MAMMAL represents a significant step toward unified biological foundation models that can span the full scope of drug discovery reasoning rather than excelling only on isolated subtasks. By demonstrating that nine diverse downstream benchmarks can be addressed from a single pretrained checkpoint without task-specific architecture modifications, the model challenges the assumption that specialization is necessary for state-of-the-art performance. IBM Research's decision to release model weights, training code, and multiple fine-tuned task-specific checkpoints on HuggingFace and GitHub has enabled rapid community adoption and benchmarking. MAMMAL is part of IBM Research's broader BioMedical Foundation Models (BMFM) program alongside MoLFormer, BioMed Multi-View, and BioMed Multi-Omic, reflecting a sustained institutional effort to build general-purpose biological AI. A current limitation is that MAMMAL does not model 3D molecular geometry explicitly; extension with structural encoders or diffusion-based 3D generation remains an open direction.