National Institute of Environmental Health Sciences
A transformer foundation model that encodes Tox21 high-throughput concentration-response curves and assay metadata into reusable 768-dimensional representations.
High-throughput screening programs such as Tox21 generate enormous volumes of quantitative concentration-response data, but those measurements are typically analyzed one curve or one assay at a time, with hand-engineered summaries such as activity calls and potency estimates (AC50). This makes it difficult to compare, organize, and interpret toxicological signals across the hundreds of biological endpoints and thousands of compounds that the program covers. Tox21mer addresses this gap by learning a single, reusable representation of concentration-response curves directly from the data, rather than relying on per-assay curve-fitting heuristics.
Released in June 2026 by researchers at the National Institute of Environmental Health Sciences (NIEHS), part of the U.S. National Institutes of Health, Tox21mer is a transformer pretrained on the Tox21 quantitative high-throughput screening (qHTS) corpus. Each concentration-response curve is encoded together with its assay metadata into a 768-dimensional embedding, producing a common feature space in which curves from different protocols, compounds, and targets can be directly compared.
Conceptually, Tox21mer is to toxicological screening data what protein and genomic language models are to sequence data: a foundation representation learned by self-supervision that downstream tasks can build on. Rather than training a bespoke model for each endpoint, researchers can extract frozen Tox21mer embeddings and attach lightweight probes for classification or regression tasks.
Tox21mer is a 43.5M-parameter transformer pretrained on roughly 2.5 million concentration-response curves drawn from 102 Tox21 assay protocols spanning 6,727 compounds. Each curve is tokenized into a learnable [CLS] token and per-concentration tokens that fuse assay metadata, log-concentration, and normalized response; the [CLS] embedding provides the 768-dimensional curve representation. Pretraining uses masked-response reconstruction as the main objective, supplemented by low-weight auxiliary supervision on assay outcome and AC50. Evaluated with linear/probe heads on frozen embeddings, the model reports a macro-F1 of 0.985 on three-class outcome prediction, a binary F1 of 0.994 for active/inactive classification, and an R2 of 0.87 for predicting log10(AC50)—indicating that the unsupervised representation captures both qualitative activity and quantitative potency. The preprint is released under a CC0 license; at the time of writing no public code repository or pretrained model weights have been identified.
Tox21mer is positioned as a reusable backbone for computational toxicology and chemical safety assessment. Its embeddings provide a common coordinate system for comparing and clustering qHTS results across protocols, compounds, and biological targets, supporting tasks such as activity classification, potency regression, and quality review of dose-response data. With appropriate downstream adaptation, the representation could support prospective screening of external chemicals across Tox21 endpoints, aiding hazard prioritization, read-across, and drug safety evaluation for toxicologists, regulatory scientists, and pharmaceutical researchers.
Tox21mer demonstrates that self-supervised foundation-model methodology can be applied to high-throughput concentration-response data, an area historically dominated by per-assay curve fitting and bespoke predictive models. By delivering a single representation that performs strongly across outcome classification and AC50 regression from frozen embeddings, it offers a template for unifying the analysis of the Tox21 library and, more broadly, large screening datasets. Its near-term reach is constrained by the lack of a released code repository or downloadable weights, which limits independent reproduction and reuse; the permissive CC0 license on the preprint nonetheless signals openness toward community adoption.
Li, L., et al. (2026) Tox21mer, A transformer foundation model for Tox21 high-throughput concentration–response curves data. bioRxiv.
DOI: 10.64898/2026.06.15.732308Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data