FoldVision

Structure-based protein encoder that voxelizes every heavy atom into a 3D grid, learning orientation-robust representations for protein function.

Released: January 2026

Parameters: 123 Million

FoldVision is a structure-based protein encoder developed at Heinrich Heine University Düsseldorf and released as a preprint in January 2026. It offers an alternative to the dominant paradigm in protein representation learning, where protein language models (PLMs) such as ESM-2 learn from amino-acid sequences and implicitly capture structural signal. FoldVision instead learns directly from 3D structure by voxelizing every heavy atom of a protein into a volumetric grid and processing it with a three-dimensional convolutional neural network, producing representations designed to be robust to the protein's arbitrary orientation in space.

The motivation is that function is ultimately determined by three-dimensional shape and the spatial arrangement of atoms, particularly at binding sites and active sites. By representing proteins explicitly in 3D rather than as sequences, FoldVision aims to encode geometric information that sequence-based encoders must infer indirectly. The model is pretrained on more than 500,000 AlphaFold-2 predicted structures, giving it broad coverage of protein fold space, and contains roughly 123 million parameters.

FoldVision is positioned alongside structure-aware methods such as Foldseek, EquiFold, and the structure modules of ESMFold, but it is fundamentally an encoder for downstream function prediction rather than a structure predictor or generator. Notably, the authors show that FoldVision matches or outperforms PLM-based encoders on several function-prediction benchmarks and that ensembling it with PLMs yields further gains, suggesting the structural and sequence views are complementary.

Key Features

All-heavy-atom voxelization: FoldVision represents proteins by voxelizing every heavy atom into a 3D grid, capturing fine-grained spatial detail rather than abstracting structure into residue-level features.
Rotation-robust 3D CNN: The convolutional architecture is designed to produce representations that are robust to the protein's arbitrary orientation, so predictions do not depend on how the structure happens to be aligned.
Pretrained on AlphaFold-2 structures: Training on more than 500,000 AlphaFold-2 predicted structures gives the encoder broad coverage of fold space without requiring experimentally solved structures.
Competitive with protein language models: FoldVision matches or exceeds PLM-based encoders on enzyme-substrate, transporter-substrate, drug-kinase, and drug-target benchmarks.
Complementary in ensembles: Combining FoldVision with PLM embeddings produces further performance gains, indicating the structural representation adds information not captured by sequence alone.

Technical Details

FoldVision is a 123-million-parameter 3D convolutional neural network that consumes a voxelized representation of all heavy atoms in a protein structure. The voxel grid and convolutional design are chosen to yield rotation-robust embeddings, addressing the orientation-dependence that complicates learning from raw 3D coordinates. The model is pretrained on a corpus of over 500,000 AlphaFold-2 predicted structures. On downstream evaluation, the preprint reports that FoldVision matches or outperforms protein language model encoders across enzyme-substrate, transporter-substrate, drug-kinase, and drug-target interaction benchmarks, and that ensembling FoldVision with PLMs improves results further. Detailed benchmark numbers and ablations are in the preprint and should be checked against the peer-reviewed version.

Applications

FoldVision is intended for researchers building predictive models of protein function and interactions, especially tasks where geometry matters — enzyme-substrate specificity, transporter-substrate matching, and drug-target or drug-kinase interaction prediction. As a drop-in structural encoder, it can be used on its own or combined with sequence-based PLM embeddings to improve accuracy. This makes it relevant to enzyme engineering, functional annotation, and computational drug discovery pipelines, particularly when AlphaFold-predicted structures are available but experimental structures are not.

Impact

FoldVision contributes evidence that explicit 3D structural encoders can be competitive with, and complementary to, the sequence-based protein language models that currently dominate the field, reviving interest in voxel-based convolutional approaches for protein representation learning. Its lasting impact will hinge on independent benchmarking and adoption. The main limitation to flag is availability: at the time of writing no public code or model weights have been released, though the preprint is distributed under a permissive CC BY license, which would facilitate reuse once artifacts become available.

Citation

FoldVision: A compute-efficient atom-level 3D protein encoder

Kroll, A., et al. (2026) FoldVision: A compute-efficient atom-level 3D protein encoder. bioRxiv.

DOI: 10.64898/2026.01.23.701326

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References42

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

20Closed

Usability — can I run it?15

Reproducibility — can I retrain it?10

Model Openness Framework

Unclassified

Missing required components

Resources

Research Paper

Key Features

All-heavy-atom voxelization: FoldVision represents proteins by voxelizing every heavy atom into a 3D grid, capturing fine-grained spatial detail rather than abstracting structure into residue-level features.

Rotation-robust 3D CNN: The convolutional architecture is designed to produce representations that are robust to the protein's arbitrary orientation, so predictions do not depend on how the structure happens to be aligned.

Pretrained on AlphaFold-2 structures: Training on more than 500,000 AlphaFold-2 predicted structures gives the encoder broad coverage of fold space without requiring experimentally solved structures.

Competitive with protein language models: FoldVision matches or exceeds PLM-based encoders on enzyme-substrate, transporter-substrate, drug-kinase, and drug-target benchmarks.

Complementary in ensembles: Combining FoldVision with PLM embeddings produces further performance gains, indicating the structural representation adds information not captured by sequence alone.

Technical Details

Applications

Impact

FoldVision

Key Features

Technical Details

Applications

Impact

Citation

FoldVision: A compute-efficient atom-level 3D protein encoder

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

FoldVision

Key Features

Technical Details

Applications

Impact

Citation

FoldVision: A compute-efficient atom-level 3D protein encoder

Recent citations

Top citations

Citations

Fields of citing research

Openness

Tags

Resources

FoldVision

#Key Features

#Technical Details

#Applications

#Impact

Citation

FoldVision: A compute-efficient atom-level 3D protein encoder

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

FoldVision

#Key Features

#Technical Details

#Applications

#Impact

Citation

FoldVision: A compute-efficient atom-level 3D protein encoder

Recent citations

Top citations

Related models

Citations

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact