A pretrained graph transformer foundation model for glycan structures, trained by masked language modeling and reaching state-of-the-art on eight glycan benchmark tasks.
GlycanGT is a graph transformer foundation model for glycans, the branched carbohydrate structures that decorate proteins and lipids and play central roles in immunity, cell signaling, and disease. Unlike linear biomolecules such as proteins and nucleic acids, glycans are tree-like graphs of monosaccharides joined by glycosidic linkages, which makes sequence-based language models a poor fit and motivates a graph-native approach. GlycanGT addresses this by treating each glycan as a graph and learning general-purpose representations that transfer across many downstream tasks.
The model was developed by Akihiro Kitani, Bingyuan Zhang, Koichi Himori, and Yusuke Matsui at Nagoya University, spanning the Department of Integrated Health Sciences and the Systems Biology Division of the Institute for Glyco-core Research (iGCORE). It was released as a bioRxiv preprint in December 2025 and subsequently published in Bioinformatics in 2026.
GlycanGT fills a gap in glycoscience, a field that has lagged behind protein and genomic modeling in adopting large pretrained models. By providing reusable embeddings and a fine-tunable backbone, it offers the glycobiology community a foundation comparable to what ESM and similar models provide for proteins.
GlycanGT uses a pure graph transformer encoder in which monosaccharide nodes and glycosidic
edges are independent tokens, augmented with orthogonal random features for node
identification, type embeddings, and a prepended [graph] token whose final embedding serves
downstream tasks. Pretraining uses masked language modeling on node and edge tokens with a
35% masking ratio, cross-entropy loss, and an edge-loss weight of 0.5. The model was trained
on 83,739 glycans curated from the GlyCosmos/GlyTouCan databases, filtered from roughly
244,842 entries to remove ambiguous structures and prevent leakage into downstream
evaluation sets. Four model scales (ss, small, medium, large) were provided, with the large
configuration used for downstream evaluation. On benchmarks, GlycanGT reaches a Macro-F1 of
0.932 on glycosylation-type prediction and an AUPRC of 0.844 on immunogenicity classification,
exceeding baselines such as GlycanAA (0.705) and RGCN (0.695), and achieves the best Macro-F1
at six of eight hierarchical taxonomy levels.
GlycanGT supports glycobiology research wherever glycan structures must be analyzed, classified, or completed. Researchers can extract embeddings for clustering and exploratory analysis, fine-tune the backbone for prediction tasks such as taxonomy, glycosylation type, and immunogenicity, or use the model to infer missing monosaccharides and linkages in incompletely characterized glycans from experimental glycomics workflows. These capabilities benefit groups working on glycan-based biomarkers, vaccine and therapeutic immunogenicity assessment, and large-scale annotation of glycan databases.
GlycanGT brings the foundation-model paradigm to glycoscience, a domain historically underserved by deep learning because of the graph-structured, branched nature of glycans. By releasing open weights, code, and pretrained representations that beat task-specific baselines across eight benchmarks, it lowers the barrier for the glycobiology community to build on a shared backbone rather than training models from scratch. Its publication in Bioinformatics and availability on Hugging Face position it as a reference point for future glycan representation learning, though, like other foundation models, its performance depends on the coverage and quality of the curated GlyCosmos/GlyTouCan training data.
Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data