Multimodal protein pre-training framework that learns sequence, 3D structure, and surface representations jointly using implicit neural representations.
ProteinINR is a multimodal pre-training framework developed by Kakao Brain that addresses a long-standing gap in protein representation learning: the underutilization of molecular surface information. Protein surfaces govern interactions with other molecules — including drug candidates, binding partners, and cellular receptors — yet most representation learning approaches at the time of its publication relied exclusively on amino acid sequences or backbone atomic coordinates. ProteinINR proposes integrating all three levels of protein description — sequence, 3D structure, and surface — into a unified pre-training pipeline.
The core technical contribution is the application of Implicit Neural Representations (INRs) to protein molecular surfaces. Rather than discretizing surfaces into fixed-resolution meshes (which introduces resolution artifacts and limits generalization), INRs parameterize the surface as a continuous function of spatial coordinates. A small neural network learns to map any 3D query point to its corresponding surface property, enabling resolution-independent surface reconstruction. This allows ProteinINR to capture fine-grained geometric and chemical features of protein surfaces that are not recoverable from backbone coordinates alone.
ProteinINR was published as a conference paper at ICLR 2024 by Youhan Lee, Hasun Yu, Jaemyung Lee, and Jaehoon Kim. The work demonstrates that incorporating surface pre-training consistently improves downstream task performance compared to methods trained on sequence and structure alone, providing empirical support for the hypothesis that surface geometry encodes functional information complementary to what is captured by sequence and backbone encoders.
ProteinINR builds on a two-encoder architecture: a sequence encoder (typically a protein language model backbone) processes amino acid sequences, and a structure encoder processes 3D coordinates. The surface learning component uses a transformer encoder to produce a latent embedding of the protein instance, which is then decoded by an INR decoder that maps 3D query coordinates to surface occupancy or property values. The decoder introduces a locality inductive bias — attending preferentially to spatially proximate surface features — that substantially improves reconstruction of fine surface detail over generic INR decoders such as TransINR.
Pre-training proceeds in three stages. First, the sequence encoder is pre-trained on protein sequence data. Second, the structure encoder is pre-trained on molecular surfaces using the ProteinINR objective, which trains the model to reconstruct protein surfaces from the learned structural embeddings. Third, the two encoders are jointly fine-tuned using multi-view contrastive learning across sequence and structure modalities. Downstream evaluation was conducted on standard protein representation benchmarks including fold classification, enzyme reaction classification, gene ontology term prediction, and protein function annotation. The paper was presented as a poster at ICLR 2024 and benchmarks against GearNet and other leading structure-based pre-training methods of the era.
ProteinINR is particularly relevant to tasks where protein surface geometry determines biological outcome: protein-protein interaction prediction, binding site identification, drug-protein docking, and functional annotation of structurally characterized proteins. Researchers working on computational drug discovery can leverage ProteinINR representations to better model surface complementarity between targets and potential ligands. The surface-informed embeddings also benefit enzyme engineering workflows where the active site geometry — a surface property — dictates catalytic specificity and efficiency.
ProteinINR contributes to a broader movement in computational biology toward multimodal protein representations that capture information at multiple scales of biological description. By demonstrating that molecular surface geometry can be encoded in a generalizable, resolution-independent way and incorporated into pre-training, the work establishes a new modality for protein representation learning alongside sequences and backbone structures. The surface-as-modality framing has subsequently influenced related multimodal models. A practical limitation is the absence of a public code release at the time of publication, which constrains independent replication and downstream adoption. The work also relies on pre-computed surface representations, which adds an additional preprocessing step compared to sequence- or structure-only baselines.
Kalvakurthi, V., et al. (2023) Hey Dona! Can you help me with student course registration?. arXiv.org.
DOI: 10.48550/arXiv.2303.13548