bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Small molecule foundation models
Small molecule

GlycanGT

Nagoya University

A pretrained graph transformer foundation model for glycan structures, trained by masked language modeling and reaching state-of-the-art on eight glycan benchmark tasks.

Released: December 2025

GlycanGT is a graph transformer foundation model for glycans, the branched carbohydrate structures that decorate proteins and lipids and play central roles in immunity, cell signaling, and disease. Unlike linear biomolecules such as proteins and nucleic acids, glycans are tree-like graphs of monosaccharides joined by glycosidic linkages, which makes sequence-based language models a poor fit and motivates a graph-native approach. GlycanGT addresses this by treating each glycan as a graph and learning general-purpose representations that transfer across many downstream tasks.

The model was developed by Akihiro Kitani, Bingyuan Zhang, Koichi Himori, and Yusuke Matsui at Nagoya University, spanning the Department of Integrated Health Sciences and the Systems Biology Division of the Institute for Glyco-core Research (iGCORE). It was released as a bioRxiv preprint in December 2025 and subsequently published in Bioinformatics in 2026.

GlycanGT fills a gap in glycoscience, a field that has lagged behind protein and genomic modeling in adopting large pretrained models. By providing reusable embeddings and a fine-tunable backbone, it offers the glycobiology community a foundation comparable to what ESM and similar models provide for proteins.

#Key Features

  • Graph-native tokenization: Built on the Tokenized Graph Transformer (TokenGT) framework, GlycanGT represents every monosaccharide as a node token and every glycosidic linkage as an edge token, applying full multi-head self-attention across the whole glycan graph rather than a fixed message-passing neighborhood.
  • Masked language modeling pretraining: The model is pretrained self-supervised by masking and reconstructing both node and edge tokens, learning glycan grammar without task-specific labels.
  • State-of-the-art across eight benchmarks: GlycanGT outperforms prior methods on eight glycan classification tasks, including taxonomy, glycosylation type, and immunogenicity.
  • Recovery of incomplete structures: It predicts ambiguous monosaccharides and linkages in partially characterized glycans, maintaining over 80% top-5 accuracy even under high masking.
  • Open weights and code: Pretrained weights are released on Hugging Face and code on GitHub under the Apache-2.0 license, with multiple model scales available.

#Technical Details

GlycanGT uses a pure graph transformer encoder in which monosaccharide nodes and glycosidic edges are independent tokens, augmented with orthogonal random features for node identification, type embeddings, and a prepended [graph] token whose final embedding serves downstream tasks. Pretraining uses masked language modeling on node and edge tokens with a 35% masking ratio, cross-entropy loss, and an edge-loss weight of 0.5. The model was trained on 83,739 glycans curated from the GlyCosmos/GlyTouCan databases, filtered from roughly 244,842 entries to remove ambiguous structures and prevent leakage into downstream evaluation sets. Four model scales (ss, small, medium, large) were provided, with the large configuration used for downstream evaluation. On benchmarks, GlycanGT reaches a Macro-F1 of 0.932 on glycosylation-type prediction and an AUPRC of 0.844 on immunogenicity classification, exceeding baselines such as GlycanAA (0.705) and RGCN (0.695), and achieves the best Macro-F1 at six of eight hierarchical taxonomy levels.

#Applications

GlycanGT supports glycobiology research wherever glycan structures must be analyzed, classified, or completed. Researchers can extract embeddings for clustering and exploratory analysis, fine-tune the backbone for prediction tasks such as taxonomy, glycosylation type, and immunogenicity, or use the model to infer missing monosaccharides and linkages in incompletely characterized glycans from experimental glycomics workflows. These capabilities benefit groups working on glycan-based biomarkers, vaccine and therapeutic immunogenicity assessment, and large-scale annotation of glycan databases.

#Impact

GlycanGT brings the foundation-model paradigm to glycoscience, a domain historically underserved by deep learning because of the graph-structured, branched nature of glycans. By releasing open weights, code, and pretrained representations that beat task-specific baselines across eight benchmarks, it lowers the barrier for the glycobiology community to build on a shared backbone rather than training models from scratch. Its publication in Bioinformatics and availability on Hugging Face position it as a reference point for future glycan representation learning, though, like other foundation models, its performance depends on the coverage and quality of the curated GlyCosmos/GlyTouCan training data.

Citation

DOI: 10.64898/2025.12.14.694171

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
82Open
Usability — can I run it?100
Reproducibility — can I retrain it?62
Model Openness Framework
Unclassified
Missing required components

Tags

foundation_modelglycobiologyglycomicsgraph_transformerrepresentation_learningself_supervisedstructure_predictiontransformer

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDataset