Overview

MAMMAL (Molecular Aligned Multi-Modal Architecture and Language) is a 458-million-parameter biological foundation model developed by IBM Research and released as a preprint in October 2024. It directly confronts a structural limitation that has constrained the computational drug discovery field: the proliferation of narrow, task-specific models that excel on individual prediction problems but cannot share information across biological modalities or generalize to the diverse multi-step reasoning that drug discovery pipelines actually require. MAMMAL instead implements a single unified architecture that simultaneously handles proteins, small molecules, and single-cell gene expression data within one model, one tokenizer, and one training procedure.

The central technical contribution of MAMMAL is a structured prompt syntax that allows any combination of biological inputs — protein sequences, SMILES strings, expression vectors, or scalar-valued metadata — to be composed into a single tokenized input regardless of the downstream task type. This design enables the model to perform classification, regression, and generation tasks within the same architecture by switching between encoder-only and encoder-decoder computation modes. Crucially, scalar numerical values (such as binding affinities, IC50 values, or expression levels) are embedded as learned projections directly into the model's token space rather than being discretized or appended as fixed vocabulary items, preserving numerical precision across orders of magnitude. Pretraining on two billion samples sourced from six large biological databases across seven distinct task types enables the model to build cross-domain representations that capture relationships between molecular, protein, and transcriptomic spaces.

Evaluated on eleven diverse downstream tasks spanning the drug discovery pipeline, MAMMAL achieves state-of-the-art results on nine and comparable performance on the remaining two — all within a single model checkpoint rather than a collection of specialized models. This breadth of coverage, combined with publicly available weights and fine-tuning code, makes MAMMAL a practical foundation for multi-task biological prediction.

Key Features

Cross-domain unified tokenizer: A single modular tokenizer handles proteins (amino acid sequences), small molecules (SMILES strings), and gene expression (cell-by-gene vectors) together, enabling mixed-modality inputs and cross-domain reasoning within one inference call.
Flexible computation modes: The model switches between encoder-only mode for representation tasks (classification, regression) and encoder-decoder mode for generative tasks (sequence design, infilling), avoiding the need to maintain separate model families for understanding and generation.
Native scalar integration: Numerical values such as binding affinities, IC50 values, and expression magnitudes are embedded directly as learned projections rather than discretized tokens, preserving continuous numerical relationships that are critical for quantitative property prediction.
Multi-align pretraining across six datasets: Pretraining spans over two billion samples from UniProt, ChEMBL, CELLxGENE, and other large biological databases, with training objectives that explicitly align representations across modalities through cross-domain tasks.
Nine of eleven state-of-the-art benchmarks: MAMMAL achieves top performance across tasks spanning cell type annotation, drug toxicity prediction, antibody complementarity-determining region (CDR) design, protein-protein interaction affinity estimation, and drug-target interaction prediction.
Public weights and fine-tuning framework: The 458M-parameter checkpoint and fine-tuning code are released on HuggingFace and GitHub under an open license, enabling researchers to adapt the model to custom biological prediction tasks without pretraining from scratch.

Technical Details

MAMMAL is implemented as a hybrid transformer with 458 million parameters, operating in either encoder-only or encoder-decoder mode depending on the task. The architecture builds on a T5-style encoder-decoder backbone that has been extended with a modular tokenization system accommodating heterogeneous biological inputs. Protein sequences use amino acid character tokenization, small molecules use SMILES character tokenization with special handling for ring closures, and single-cell expression profiles are represented as sets of gene-value pairs with learned per-gene embeddings. Numerical scalar values are projected into the token embedding dimension via a two-layer MLP, allowing continuous quantities to be processed natively without vocabulary expansion.

Pretraining was conducted on two billion samples drawn from six datasets: protein sequences from UniProt, small molecule bioactivity data from ChEMBL and PubChem, antibody-antigen binding data, drug-target interaction pairs, and single-cell expression data from CELLxGENE. Seven distinct pretraining objectives were used, including within-modality masked reconstruction, cross-domain alignment tasks that map molecular representations to associated protein activity data, and scalar regression objectives. On the eleven downstream benchmarks, key reported results include: cell type annotation F1 of 0.763 (a 7.5% improvement over the prior best), drug toxicity prediction AUROC of 0.986 on ClinTox (a 4.0% improvement), antibody CDR-H3 infilling amino acid recovery of 0.446 (a 19% improvement), protein-protein interaction ΔΔG Pearson correlation of 0.852 (a 28.5% improvement), and drug-target interaction NRMSE of 0.906. On antibody-antigen and nanobody-antigen binding classification benchmarks, MAMMAL significantly outperformed AlphaFold 3 in three of four tested targets.

Applications

MAMMAL is suited for computational teams working across the drug discovery pipeline who need a single versatile model rather than a portfolio of specialized tools. In target identification, the cell type annotation and gene expression capabilities can help characterize disease-relevant cell populations from single-cell datasets. In hit identification and lead optimization, the drug-target interaction and molecular property prediction capabilities support virtual screening and ADMET profiling. The antibody CDR design and protein-protein interaction affinity modules are directly applicable to biologics development workflows. Because MAMMAL accepts mixed-modality prompts, it is particularly well-suited to tasks that require jointly reasoning over molecular structure and biological context — for example, predicting a compound's activity in a specific cell type rather than in a generic biochemical assay. Researchers can fine-tune the publicly available checkpoint on custom labeled datasets with relatively modest compute given the 458M parameter scale.

Impact

MAMMAL represents a significant step toward unified biological foundation models that can span the full scope of drug discovery reasoning rather than excelling only on isolated subtasks. By demonstrating that nine diverse downstream benchmarks can be addressed from a single pretrained checkpoint without task-specific architecture modifications, the model challenges the assumption that specialization is necessary for state-of-the-art performance. IBM Research's decision to release model weights, training code, and multiple fine-tuned task-specific checkpoints on HuggingFace and GitHub has enabled rapid community adoption and benchmarking. MAMMAL is part of IBM Research's broader BioMedical Foundation Models (BMFM) program alongside MoLFormer, BioMed Multi-View, and BioMed Multi-Omic, reflecting a sustained institutional effort to build general-purpose biological AI. A current limitation is that MAMMAL does not model 3D molecular geometry explicitly; extension with structural encoders or diffusion-based 3D generation remains an open direction.

Overview

Key Features

Cross-domain unified tokenizer: A single modular tokenizer handles proteins (amino acid sequences), small molecules (SMILES strings), and gene expression (cell-by-gene vectors) together, enabling mixed-modality inputs and cross-domain reasoning within one inference call.

Flexible computation modes: The model switches between encoder-only mode for representation tasks (classification, regression) and encoder-decoder mode for generative tasks (sequence design, infilling), avoiding the need to maintain separate model families for understanding and generation.

Native scalar integration: Numerical values such as binding affinities, IC50 values, and expression magnitudes are embedded directly as learned projections rather than discretized tokens, preserving continuous numerical relationships that are critical for quantitative property prediction.

Multi-align pretraining across six datasets: Pretraining spans over two billion samples from UniProt, ChEMBL, CELLxGENE, and other large biological databases, with training objectives that explicitly align representations across modalities through cross-domain tasks.

Nine of eleven state-of-the-art benchmarks: MAMMAL achieves top performance across tasks spanning cell type annotation, drug toxicity prediction, antibody complementarity-determining region (CDR) design, protein-protein interaction affinity estimation, and drug-target interaction prediction.

Public weights and fine-tuning framework: The 458M-parameter checkpoint and fine-tuning code are released on HuggingFace and GitHub under an open license, enabling researchers to adapt the model to custom biological prediction tasks without pretraining from scratch.

Technical Details

Applications

Impact

MAMMAL

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources

MAMMAL

Overview

Key Features

Technical Details

Applications

Impact

Tags

Resources