GlycanGT

Graph transformer foundation model for glycans, learning reusable embeddings of branched carbohydrate structures for glycomics prediction tasks.

Released: December 2025

GlycanGT is a graph transformer foundation model for glycans, the branched carbohydrate structures that decorate proteins and lipids and play central roles in immunity, cell signaling, and disease. Unlike linear biomolecules such as proteins and nucleic acids, glycans are tree-like graphs of monosaccharides joined by glycosidic linkages, which makes sequence-based language models a poor fit and motivates a graph-native approach. GlycanGT addresses this by treating each glycan as a graph and learning general-purpose representations that transfer across many downstream tasks.

The model was developed by Akihiro Kitani, Bingyuan Zhang, Koichi Himori, and Yusuke Matsui at Nagoya University, spanning the Department of Integrated Health Sciences and the Systems Biology Division of the Institute for Glyco-core Research (iGCORE). It was released as a bioRxiv preprint in December 2025 and subsequently published in Bioinformatics in 2026.

GlycanGT fills a gap in glycoscience, a field that has lagged behind protein and genomic modeling in adopting large pretrained models. By providing reusable embeddings and a fine-tunable backbone, it offers the glycobiology community a foundation comparable to what ESM and similar models provide for proteins.

Key Features

Graph-native tokenization: Built on the Tokenized Graph Transformer (TokenGT) framework, GlycanGT represents every monosaccharide as a node token and every glycosidic linkage as an edge token, applying full multi-head self-attention across the whole glycan graph rather than a fixed message-passing neighborhood.
Masked language modeling pretraining: The model is pretrained self-supervised by masking and reconstructing both node and edge tokens, learning glycan grammar without task-specific labels.
State-of-the-art across eight benchmarks: GlycanGT outperforms prior methods on eight glycan classification tasks, including taxonomy, glycosylation type, and immunogenicity.
Recovery of incomplete structures: It predicts ambiguous monosaccharides and linkages in partially characterized glycans, maintaining over 80% top-5 accuracy even under high masking.
Open weights and code: Pretrained weights are released on Hugging Face and code on GitHub under the Apache-2.0 license, with multiple model scales available.

Technical Details

GlycanGT uses a pure graph transformer encoder in which monosaccharide nodes and glycosidic edges are independent tokens, augmented with orthogonal random features for node identification, type embeddings, and a prepended [graph] token whose final embedding serves downstream tasks. Pretraining uses masked language modeling on node and edge tokens with a 35% masking ratio, cross-entropy loss, and an edge-loss weight of 0.5. The model was trained on 83,739 glycans curated from the GlyCosmos/GlyTouCan databases, filtered from roughly 244,842 entries to remove ambiguous structures and prevent leakage into downstream evaluation sets. Four model scales (ss, small, medium, large) were provided, with the large configuration used for downstream evaluation. On benchmarks, GlycanGT reaches a Macro-F1 of 0.932 on glycosylation-type prediction and an AUPRC of 0.844 on immunogenicity classification, exceeding baselines such as GlycanAA (0.705) and RGCN (0.695), and achieves the best Macro-F1 at six of eight hierarchical taxonomy levels.

Applications

GlycanGT supports glycobiology research wherever glycan structures must be analyzed, classified, or completed. Researchers can extract embeddings for clustering and exploratory analysis, fine-tune the backbone for prediction tasks such as taxonomy, glycosylation type, and immunogenicity, or use the model to infer missing monosaccharides and linkages in incompletely characterized glycans from experimental glycomics workflows. These capabilities benefit groups working on glycan-based biomarkers, vaccine and therapeutic immunogenicity assessment, and large-scale annotation of glycan databases.

Impact

GlycanGT brings the foundation-model paradigm to glycoscience, a domain historically underserved by deep learning because of the graph-structured, branched nature of glycans. By releasing open weights, code, and pretrained representations that beat task-specific baselines across eight benchmarks, it lowers the barrier for the glycobiology community to build on a shared backbone rather than training models from scratch. Its publication in Bioinformatics and availability on Hugging Face position it as a reference point for future glycan representation learning, though, like other foundation models, its performance depends on the coverage and quality of the curated GlyCosmos/GlyTouCan training data.

Citation

GlycanGT: A Foundation Model for Glycan Graphs with Pretrained Representation and Generative Learning

Kitani, A., et al. (2025) GlycanGT: A Foundation Model for Glycan Graphs with Pretrained Representation and Generative Learning. bioRxiv.

DOI: 10.64898/2025.12.14.694171

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References20

GitHub

Stars3

Forks0

Open Issues0

Contributors1

Last Push3mo ago

LanguagePython

LicenseApache-2.0

HuggingFace

Downloads0

Likes0

Last Modified3mo ago

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible

82Open

Usability — can I run it?100

Reproducibility — can I retrain it?62

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper HuggingFace Model Dataset

Key Features

Graph-native tokenization: Built on the Tokenized Graph Transformer (TokenGT) framework, GlycanGT represents every monosaccharide as a node token and every glycosidic linkage as an edge token, applying full multi-head self-attention across the whole glycan graph rather than a fixed message-passing neighborhood.

Masked language modeling pretraining: The model is pretrained self-supervised by masking and reconstructing both node and edge tokens, learning glycan grammar without task-specific labels.

State-of-the-art across eight benchmarks: GlycanGT outperforms prior methods on eight glycan classification tasks, including taxonomy, glycosylation type, and immunogenicity.

Recovery of incomplete structures: It predicts ambiguous monosaccharides and linkages in partially characterized glycans, maintaining over 80% top-5 accuracy even under high masking.

Open weights and code: Pretrained weights are released on Hugging Face and code on GitHub under the Apache-2.0 license, with multiple model scales available.

Technical Details

Applications

Impact

GlycanGT

Key Features

Technical Details

Applications

Impact

Citation

GlycanGT: A Foundation Model for Glycan Graphs with Pretrained Representation and Generative Learning

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GlycanGT

Key Features

Technical Details

Applications

Impact

Citation

GlycanGT: A Foundation Model for Glycan Graphs with Pretrained Representation and Generative Learning

Recent citations

Top citations

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GlycanGT

#Key Features

#Technical Details

#Applications

#Impact

Citation

GlycanGT: A Foundation Model for Glycan Graphs with Pretrained Representation and Generative Learning

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

GlycanGT

#Key Features

#Technical Details

#Applications

#Impact

Citation

GlycanGT: A Foundation Model for Glycan Graphs with Pretrained Representation and Generative Learning

Recent citations

Top citations

Related models

Citations

GitHub

HuggingFace

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact