bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule foundation models
Small moleculeBiosignals

Tox21mer

National Institute of Environmental Health Sciences

A transformer foundation model that encodes Tox21 high-throughput concentration-response curves and assay metadata into reusable 768-dimensional representations.

Released: June 2026
Parameters: 43.5 Million

High-throughput screening programs such as Tox21 generate enormous volumes of quantitative concentration-response data, but those measurements are typically analyzed one curve or one assay at a time, with hand-engineered summaries such as activity calls and potency estimates (AC50). This makes it difficult to compare, organize, and interpret toxicological signals across the hundreds of biological endpoints and thousands of compounds that the program covers. Tox21mer addresses this gap by learning a single, reusable representation of concentration-response curves directly from the data, rather than relying on per-assay curve-fitting heuristics.

Released in June 2026 by researchers at the National Institute of Environmental Health Sciences (NIEHS), part of the U.S. National Institutes of Health, Tox21mer is a transformer pretrained on the Tox21 quantitative high-throughput screening (qHTS) corpus. Each concentration-response curve is encoded together with its assay metadata into a 768-dimensional embedding, producing a common feature space in which curves from different protocols, compounds, and targets can be directly compared.

Conceptually, Tox21mer is to toxicological screening data what protein and genomic language models are to sequence data: a foundation representation learned by self-supervision that downstream tasks can build on. Rather than training a bespoke model for each endpoint, researchers can extract frozen Tox21mer embeddings and attach lightweight probes for classification or regression tasks.

#Key Features

  • Curve-level foundation representation: Tox21mer encodes an entire concentration-response curve plus assay context into one 768-dimensional vector, giving a transferable representation rather than a single scalar summary.
  • Masked-response pretraining: The primary self-supervised objective is masked-response reconstruction—predicting held-out response values along the dose axis—so the model learns the shape and structure of dose-response behavior.
  • Auxiliary supervision: Low-weight auxiliary heads predict assay outcome (active/inactive class) and AC50 during pretraining, gently aligning the representation with toxicologically meaningful quantities without dominating it.
  • Token-based curve encoding: Each curve is represented as a learnable [CLS] token plus per-concentration tokens that combine assay context, log-concentration, and normalized response, letting the transformer attend across the dose axis.
  • Frozen-embedding probing: Strong downstream performance from simple probes on frozen embeddings demonstrates that the learned features are broadly reusable without fine-tuning the backbone.

#Technical Details

Tox21mer is a 43.5M-parameter transformer pretrained on roughly 2.5 million concentration-response curves drawn from 102 Tox21 assay protocols spanning 6,727 compounds. Each curve is tokenized into a learnable [CLS] token and per-concentration tokens that fuse assay metadata, log-concentration, and normalized response; the [CLS] embedding provides the 768-dimensional curve representation. Pretraining uses masked-response reconstruction as the main objective, supplemented by low-weight auxiliary supervision on assay outcome and AC50. Evaluated with linear/probe heads on frozen embeddings, the model reports a macro-F1 of 0.985 on three-class outcome prediction, a binary F1 of 0.994 for active/inactive classification, and an R2 of 0.87 for predicting log10(AC50)—indicating that the unsupervised representation captures both qualitative activity and quantitative potency. The preprint is released under a CC0 license; at the time of writing no public code repository or pretrained model weights have been identified.

#Applications

Tox21mer is positioned as a reusable backbone for computational toxicology and chemical safety assessment. Its embeddings provide a common coordinate system for comparing and clustering qHTS results across protocols, compounds, and biological targets, supporting tasks such as activity classification, potency regression, and quality review of dose-response data. With appropriate downstream adaptation, the representation could support prospective screening of external chemicals across Tox21 endpoints, aiding hazard prioritization, read-across, and drug safety evaluation for toxicologists, regulatory scientists, and pharmaceutical researchers.

#Impact

Tox21mer demonstrates that self-supervised foundation-model methodology can be applied to high-throughput concentration-response data, an area historically dominated by per-assay curve fitting and bespoke predictive models. By delivering a single representation that performs strongly across outcome classification and AC50 regression from frozen embeddings, it offers a template for unifying the analysis of the Tox21 library and, more broadly, large screening datasets. Its near-term reach is constrained by the lack of a released code repository or downloadable weights, which limits independent reproduction and reuse; the permissive CC0 license on the preprint nonetheless signals openness toward community adoption.

Citation

Tox21mer, A transformer foundation model for Tox21 high-throughput concentration–response curves data

Li, L., et al. (2026) Tox21mer, A transformer foundation model for Tox21 high-throughput concentration–response curves data. bioRxiv.

DOI: 10.64898/2026.06.15.732308

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0
Influential0
References32

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility
23Closed
Usability — can I run it?15
Reproducibility — can I retrain it?18
Model Openness Framework
Unclassified
Missing required components

Tags

embeddingsfoundation_modelrepresentation_learningself_supervisedtransformer

Resources

Research Paper