Heinrich Heine University Düsseldorf
A 3D CNN that voxelizes every heavy atom for rotation-robust protein structural representations, matching or beating protein language model encoders on function benchmarks.
FoldVision is a structure-based protein encoder developed at Heinrich Heine University Düsseldorf and released as a preprint in January 2026. It offers an alternative to the dominant paradigm in protein representation learning, where protein language models (PLMs) such as ESM-2 learn from amino-acid sequences and implicitly capture structural signal. FoldVision instead learns directly from 3D structure by voxelizing every heavy atom of a protein into a volumetric grid and processing it with a three-dimensional convolutional neural network, producing representations designed to be robust to the protein's arbitrary orientation in space.
The motivation is that function is ultimately determined by three-dimensional shape and the spatial arrangement of atoms, particularly at binding sites and active sites. By representing proteins explicitly in 3D rather than as sequences, FoldVision aims to encode geometric information that sequence-based encoders must infer indirectly. The model is pretrained on more than 500,000 AlphaFold-2 predicted structures, giving it broad coverage of protein fold space, and contains roughly 123 million parameters.
FoldVision is positioned alongside structure-aware methods such as Foldseek, EquiFold, and the structure modules of ESMFold, but it is fundamentally an encoder for downstream function prediction rather than a structure predictor or generator. Notably, the authors show that FoldVision matches or outperforms PLM-based encoders on several function-prediction benchmarks and that ensembling it with PLMs yields further gains, suggesting the structural and sequence views are complementary.
FoldVision is a 123-million-parameter 3D convolutional neural network that consumes a voxelized representation of all heavy atoms in a protein structure. The voxel grid and convolutional design are chosen to yield rotation-robust embeddings, addressing the orientation-dependence that complicates learning from raw 3D coordinates. The model is pretrained on a corpus of over 500,000 AlphaFold-2 predicted structures. On downstream evaluation, the preprint reports that FoldVision matches or outperforms protein language model encoders across enzyme-substrate, transporter-substrate, drug-kinase, and drug-target interaction benchmarks, and that ensembling FoldVision with PLMs improves results further. Detailed benchmark numbers and ablations are in the preprint and should be checked against the peer-reviewed version.
FoldVision is intended for researchers building predictive models of protein function and interactions, especially tasks where geometry matters — enzyme-substrate specificity, transporter-substrate matching, and drug-target or drug-kinase interaction prediction. As a drop-in structural encoder, it can be used on its own or combined with sequence-based PLM embeddings to improve accuracy. This makes it relevant to enzyme engineering, functional annotation, and computational drug discovery pipelines, particularly when AlphaFold-predicted structures are available but experimental structures are not.
FoldVision contributes evidence that explicit 3D structural encoders can be competitive with, and complementary to, the sequence-based protein language models that currently dominate the field, reviving interest in voxel-based convolutional approaches for protein representation learning. Its lasting impact will hinge on independent benchmarking and adoption. The main limitation to flag is availability: at the time of writing no public code or model weights have been released, though the preprint is distributed under a permissive CC BY license, which would facilitate reuse once artifacts become available.