Skolkovo Institute of Science and Technology
Multimodal protein language model extending ESM-2 and SaProt with a Structure Adapter that encodes backbone and side-chain torsion angles for improved function prediction.
Modern protein language models trained on amino acid sequences alone have proven remarkably effective at learning evolutionarily meaningful representations. However, protein function is ultimately determined by three-dimensional structure, and residue-level biochemical properties — electrostatics, hydrogen-bond geometry, side-chain packing — are only partially recoverable from sequence statistics. MULAN (MULtimodal Protein LANguage Model) addresses this gap by extending existing sequence-based protein language models with a lightweight structural encoding module, enabling the combined representation of both sequence identity and backbone/side-chain geometry without requiring retraining from scratch.
MULAN was developed by Daria Frolova, Marina Pak, Anna Litvin, Ilya Sharov, Dmitry Ivankov, and Ivan Oseledets at the Center for Artificial Intelligence Technology and the Center for Molecular and Cellular Biology at the Skolkovo Institute of Science and Technology (Skoltech) in Moscow. First posted to bioRxiv in May 2024 and published in Bioinformatics Advances in May 2025, MULAN introduces the Structure Adapter — a parameter-efficient module that processes residue torsion angles and fuses them with the sequence representations produced by an underlying protein language model. Crucially, the Structure Adapter is designed to be grafted onto existing pretrained models (ESM-2 or SaProt) rather than requiring training a large multimodal model from scratch, making MULAN both computationally accessible and compatible with established model ecosystems.
The broader context for MULAN is the rapid proliferation of protein language models with varying degrees of structural awareness. SaProt, for example, encodes protein sequences as combined sequence-structure tokens using Foldseek's 3Di alphabet, introducing structural information at the tokenization level. ESM-3 jointly reasons over sequence, structure, and function modalities through a unified generative framework. MULAN occupies a complementary niche: it takes any pretrained sequence model and cheaply augments it with continuous geometric information derived from torsion angles, achieving improved downstream task performance without the scale or training cost of a fully multimodal architecture. The model was evaluated on nine protein function prediction benchmarks spanning diverse task types, consistently outperforming both sequence-only ESM-2 and the structure-tokenized SaProt on most tasks.
Structure Adapter module: MULAN introduces a small trainable module that accepts backbone phi/psi dihedral angles and up to five side-chain chi angles per residue as input, projects them into the embedding space of the underlying language model, and adds the resulting structural embeddings to the sequence embeddings before each transformer block. This additive fusion preserves the pretrained sequence representations while injecting local geometric context.
Rotation- and translation-invariant structural encoding: Torsion angles are intrinsically invariant to rigid-body transformations of the protein — they describe internal degrees of freedom rather than absolute coordinates. This makes torsion angle representation a natural choice for injecting structural information into transformers, which lack built-in spatial awareness, without requiring coordinate-frame alignment or equivariant network components.
Compatible with ESM-2 and SaProt backbones: MULAN is designed as a general augmentation layer rather than a fixed architecture. The pretrained backbone can be either the ESM-2 family (which uses the standard 20-amino-acid vocabulary) or SaProt (which uses a combined sequence-structure vocabulary derived from Foldseek's 3Di structural alphabet). For SaProt-based MULAN, structural information enters through both the tokenization scheme and the Structure Adapter, providing complementary structural signals.
Multiple model sizes: Model checkpoints are available at three scales — MULAN-small (approximately 9M parameters), MULAN-ESM2-35M, and MULAN-ESM2-650M — enabling researchers to match model capacity to their computational budget. The medium (35M) model provides a particularly favorable balance of performance and resource consumption on most benchmarks evaluated.
Structural awareness without AlphaFold overhead at inference: While MULAN requires structural information (torsion angles) as input, these can be derived rapidly from AlphaFold 2 or ESMFold predictions rather than experimental structures. For the many proteins now covered by AlphaFold DB, structural annotations are immediately available. For new sequences, ESMFold can produce backbone coordinates in seconds, making MULAN's augmented representations broadly applicable.
Continued pretraining with torsion angle prediction: Beyond task-specific fine-tuning, MULAN undergoes continued pretraining from the ESM-2 initialization using a joint objective that includes masked amino acid prediction and torsion angle prediction. This pretraining step teaches the model to correlate sequence patterns with geometric constraints before any task-specific supervision, improving the quality of structural representations for downstream use.
MULAN's architecture is built around the ESM-2 transformer encoder (available in 35M and 650M configurations). The Structure Adapter consists of a small feedforward network that takes concatenated torsion angle values — four backbone angles (phi, psi, omega, and tau) plus up to five side-chain chi angles — encodes them into a continuous vector, and adds this vector to the residue's sequence embedding. The augmented embedding is then passed through the standard ESM-2 transformer layers. The Structure Adapter introduces only a small fraction of new parameters relative to the backbone, consistent with a parameter-efficient design philosophy.
For pretraining, MULAN is initialized from a pretrained ESM-2 or SaProt checkpoint and then continued-pretrained on a subset of the AlphaFold Database, using AlphaFold-predicted structures as the source of torsion angle supervision. This approach allows the model to leverage the large-scale structural coverage of AlphaFold DB without requiring experimental structure data. The torsion angle prediction loss is combined with the standard masked language modeling objective, with loss weighting tuned to balance the two supervision signals.
Benchmark evaluation covers nine downstream tasks: protein-protein interaction (PPI) prediction, enzyme commission (EC) number classification, gene ontology (GO) biological process and molecular function annotation, subcellular localization prediction, protein stability prediction, fluorescence prediction, remote homology detection, and contact prediction. The primary reported result is that MULAN models consistently outperform both ESM-2 and SaProt at matched parameter counts, with the most pronounced improvement on PPI prediction (up to 0.12 improvement in AUROC), a task that directly depends on understanding surface geometry and interface residue chemistry. Performance gains relative to baseline are consistent across the small and medium model sizes, and MULAN is competitive with substantially larger models such as Ankh, ESM-3, and ProstT5 on several benchmarks despite using fewer parameters. Rotary positional embeddings were added in updated model checkpoints released in March 2025.
MULAN is particularly well-suited for computational biologists who need improved protein function predictions for proteins with available structural models. The most direct use cases include PPI network analysis, where MULAN's superior interface-aware representations can help distinguish interacting from non-interacting protein pairs in proteome-wide screens; enzyme function annotation, where the combination of sequence and torsion angle information helps capture the active-site geometry that determines catalytic specificity; and subcellular localization prediction, where structural compactness and surface properties influence trafficking signals. Because MULAN can use AlphaFold DB predictions as structural inputs, it is immediately applicable to any protein in the human proteome or the Swiss-Prot/TrEMBL reference databases. Structural and wet-lab biologists can use MULAN to prioritize candidates for co-immunoprecipitation or yeast two-hybrid validation from long lists of computationally predicted interaction partners. The availability of models on GitHub with permissive licensing makes integration into existing Python-based analysis pipelines straightforward.
MULAN contributes a practically important design principle to the protein machine learning field: that cheap structural augmentation of existing pretrained language models — adding torsion angles through a lightweight adapter — reliably improves downstream task performance relative to both sequence-only and structure-tokenized baselines, across a diverse benchmark panel. This is significant because it demonstrates that continuous geometric information carries complementary signal to what is captured by either sequence statistics or discrete structural tokens, and that this complementarity can be exploited without expensive multimodal pretraining from scratch. The work also highlights torsion angles as an effective and architecturally simple structural representation for transformers, encouraging further exploration of geometry-aware augmentation strategies. A key limitation is that MULAN inherits its structural information from predicted structures (typically AlphaFold 2), meaning that prediction errors propagate into MULAN's structural embeddings; for proteins where AlphaFold confidence is low (disordered regions, novel folds), the structural adapter may provide noisy rather than informative signal. The published version in Bioinformatics Advances and the publicly released checkpoints position MULAN as a ready-to-use tool for the community, and the GitHub repository has been actively maintained with updates through 2025.