bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Protein foundation models
Protein

FoldVision

Heinrich Heine University Düsseldorf

A 3D CNN that voxelizes every heavy atom for rotation-robust protein structural representations, matching or beating protein language model encoders on function benchmarks.

Released: January 2026
Parameters: 123 Million

FoldVision is a structure-based protein encoder developed at Heinrich Heine University Düsseldorf and released as a preprint in January 2026. It offers an alternative to the dominant paradigm in protein representation learning, where protein language models (PLMs) such as ESM-2 learn from amino-acid sequences and implicitly capture structural signal. FoldVision instead learns directly from 3D structure by voxelizing every heavy atom of a protein into a volumetric grid and processing it with a three-dimensional convolutional neural network, producing representations designed to be robust to the protein's arbitrary orientation in space.

The motivation is that function is ultimately determined by three-dimensional shape and the spatial arrangement of atoms, particularly at binding sites and active sites. By representing proteins explicitly in 3D rather than as sequences, FoldVision aims to encode geometric information that sequence-based encoders must infer indirectly. The model is pretrained on more than 500,000 AlphaFold-2 predicted structures, giving it broad coverage of protein fold space, and contains roughly 123 million parameters.

FoldVision is positioned alongside structure-aware methods such as Foldseek, EquiFold, and the structure modules of ESMFold, but it is fundamentally an encoder for downstream function prediction rather than a structure predictor or generator. Notably, the authors show that FoldVision matches or outperforms PLM-based encoders on several function-prediction benchmarks and that ensembling it with PLMs yields further gains, suggesting the structural and sequence views are complementary.

#Key Features

  • All-heavy-atom voxelization: FoldVision represents proteins by voxelizing every heavy atom into a 3D grid, capturing fine-grained spatial detail rather than abstracting structure into residue-level features.
  • Rotation-robust 3D CNN: The convolutional architecture is designed to produce representations that are robust to the protein's arbitrary orientation, so predictions do not depend on how the structure happens to be aligned.
  • Pretrained on AlphaFold-2 structures: Training on more than 500,000 AlphaFold-2 predicted structures gives the encoder broad coverage of fold space without requiring experimentally solved structures.
  • Competitive with protein language models: FoldVision matches or exceeds PLM-based encoders on enzyme-substrate, transporter-substrate, drug-kinase, and drug-target benchmarks.
  • Complementary in ensembles: Combining FoldVision with PLM embeddings produces further performance gains, indicating the structural representation adds information not captured by sequence alone.

#Technical Details

FoldVision is a 123-million-parameter 3D convolutional neural network that consumes a voxelized representation of all heavy atoms in a protein structure. The voxel grid and convolutional design are chosen to yield rotation-robust embeddings, addressing the orientation-dependence that complicates learning from raw 3D coordinates. The model is pretrained on a corpus of over 500,000 AlphaFold-2 predicted structures. On downstream evaluation, the preprint reports that FoldVision matches or outperforms protein language model encoders across enzyme-substrate, transporter-substrate, drug-kinase, and drug-target interaction benchmarks, and that ensembling FoldVision with PLMs improves results further. Detailed benchmark numbers and ablations are in the preprint and should be checked against the peer-reviewed version.

#Applications

FoldVision is intended for researchers building predictive models of protein function and interactions, especially tasks where geometry matters — enzyme-substrate specificity, transporter-substrate matching, and drug-target or drug-kinase interaction prediction. As a drop-in structural encoder, it can be used on its own or combined with sequence-based PLM embeddings to improve accuracy. This makes it relevant to enzyme engineering, functional annotation, and computational drug discovery pipelines, particularly when AlphaFold-predicted structures are available but experimental structures are not.

#Impact

FoldVision contributes evidence that explicit 3D structural encoders can be competitive with, and complementary to, the sequence-based protein language models that currently dominate the field, reviving interest in voxel-based convolutional approaches for protein representation learning. Its lasting impact will hinge on independent benchmarking and adoption. The main limitation to flag is availability: at the time of writing no public code or model weights have been released, though the preprint is distributed under a permissive CC BY license, which would facilitate reuse once artifacts become available.

Openness

bio.rodeo opennessClosed · low usability and reproducibility
20Closed
Usability — can I run it?15
Reproducibility — can I retrain it?10
Model Openness Framework
Unclassified
Missing required components

Tags

representation_learningdrug_discoveryprotein_function_predictioncnnself_supervisedrepresentation_learningtransfer_learningproteomicsenzymes

Resources

Research Paper