Overview

CARP (Convolutional Autoencoding Representations of Proteins) is a family of CNN-based protein language models developed by Kevin Yang, Alex Lu, and Nicolo Fusi at Microsoft Research New England. The central finding is a challenge to a prevailing assumption in the field: that transformer architectures are inherently superior to convolutional neural networks for learning representations from protein sequences. CARP demonstrates that CNNs trained with the same masked language modeling objective as transformer-based models — on the same dataset — achieve competitive or better performance across a broad set of downstream biological prediction tasks.

The work appeared first as a bioRxiv preprint in May 2022 and was subsequently published in Cell Systems in March 2024. At the time of its release, the dominant protein language models were transformer-based (ESM-1b, ESM-1v, ProtBERT), and the transformer's self-attention mechanism was widely credited with enabling long-range residue dependency modeling. CARP questions whether this architectural choice is as decisive as commonly assumed, at least for sequence-level representation learning.

A practically significant advantage of the convolutional approach is its scaling behavior. Transformer self-attention scales quadratically with sequence length, which imposes hard limits on the sequence lengths that can be processed without architectural modifications or memory-intensive workarounds. CARP's ByteNet-based architecture scales linearly with sequence length and, crucially, does not rely on positional embeddings, allowing it to generalize to sequences of arbitrary length at inference time — tested up to 4,096 residues. This property is particularly relevant for proteins with very long sequences that exceed the limits of transformer-based models like ESM-1b.

Key Features

CNN competitiveness: Averaged across 41 fitness and function prediction datasets from the ProteinGym benchmark, CARP-640M achieves a Spearman correlation of 0.49, versus 0.46 for ESM-1b and 0.51 for ESM-1v, outperforming ESM-1b on 23 of 41 datasets.
Linear sequence scaling: The ByteNet encoder scales linearly in computation with input length and has no positional embeddings, enabling zero-shot mutation effect prediction on sequences longer than the hard limits imposed by transformer positional encoding schemes.
Model size range: Four pretrained checkpoints are available — carp_600k, carp_38M, carp_76M, and carp_640M — enabling researchers to select the appropriate accuracy-compute tradeoff for their application.
Identical pretraining protocol: All CARP models are trained with masked language modeling on the March 2020 release of UniRef50, the same dataset and objective as ESM-1b, enabling clean controlled comparisons between architectures.
Zero-shot variant effect prediction: Like ESM-1v, CARP can score the effect of point mutations without fine-tuning, using the masked marginal log-likelihood as a fitness proxy, and generalizes this capability to longer sequences where transformer models are constrained.
Ready-to-use embeddings: The repository provides scripts for bulk embedding extraction from FASTA files with options for per-token representations and mean-pooled sequence embeddings, making integration into downstream pipelines straightforward.

Technical Details

CARP models use a ByteNet encoder architecture — a hierarchical dilated causal convolutional network originally developed for text generation and adapted here for bidirectional protein sequence modeling. Dilated convolutions allow the receptive field to grow exponentially with network depth, capturing long-range dependencies without the memory cost of full self-attention. The largest model, CARP-640M, contains approximately 640 million parameters, directly comparable to ESM-1b's 650 million parameters. All models are pretrained on the March 2020 release of UniRef50 using a BERT-style masked language modeling (MLM) objective, where a fraction of input residues are masked and the model is trained to recover the original amino acid token from the remaining context.

Evaluation against ESM models on the ProteinGym benchmark — spanning fluorescence, stability, thermostability, and fitness prediction datasets — showed CARP-640M competitive with ESM-1b across the suite and comparable to ESM-1v on zero-shot mutation scoring. For structure-related tasks such as secondary structure and contact prediction, CARP linear probes achieve results in the same range as equivalently sized ESM models, underscoring that long-range structural information is accessible from convolutional representations. Inference on CARP-640M is fast and memory-efficient relative to transformer baselines of similar parameter count, particularly on long sequences where the quadratic scaling of attention becomes the dominant cost.

Applications

CARP is used in protein engineering workflows where efficient sequence embedding is needed, particularly for scanning large mutant libraries or working with unusually long proteins. The zero-shot mutation effect prediction capability makes it directly applicable to variant effect scoring for enzymes, antibodies, and other engineered proteins without requiring labeled training data for each target. The availability of multiple model sizes (600k through 640M parameters) allows practitioners to balance speed and accuracy: smaller CARP models are practical for rapid screening of millions of variants, while CARP-640M is suitable for applications where maximum prediction accuracy matters. Researchers working with long protein sequences — such as full-length filamentous proteins, multi-domain receptors, or non-standard organisms with unusually long coding sequences — benefit from CARP's lack of sequence length constraints.

Impact

CARP made an important methodological contribution by demonstrating that the dominance of transformers in protein language modeling is not absolute. By holding the pretraining objective and dataset fixed while varying only the architecture, Yang et al. provided clean evidence that CNN-based representations can match transformer-based ones on standard benchmarks. This has influenced thinking about the principles underlying protein language model performance and motivated follow-on work examining what aspects of architecture matter most for biological sequence modeling. A notable limitation is that CARP, like all single-sequence protein language models, does not incorporate evolutionary information from MSAs during inference, which constrains its ability to match MSA-augmented models like ESM-MSA-1b on contact prediction tasks. Additionally, while CARP generalizes to long sequences, it was evaluated primarily on single-chain proteins and its behavior on complex multi-domain or intrinsically disordered sequences remains less thoroughly benchmarked than in transformer-based successors.

Overview

Key Features

CNN competitiveness: Averaged across 41 fitness and function prediction datasets from the ProteinGym benchmark, CARP-640M achieves a Spearman correlation of 0.49, versus 0.46 for ESM-1b and 0.51 for ESM-1v, outperforming ESM-1b on 23 of 41 datasets.

Linear sequence scaling: The ByteNet encoder scales linearly in computation with input length and has no positional embeddings, enabling zero-shot mutation effect prediction on sequences longer than the hard limits imposed by transformer positional encoding schemes.

Model size range: Four pretrained checkpoints are available — carp_600k, carp_38M, carp_76M, and carp_640M — enabling researchers to select the appropriate accuracy-compute tradeoff for their application.

Identical pretraining protocol: All CARP models are trained with masked language modeling on the March 2020 release of UniRef50, the same dataset and objective as ESM-1b, enabling clean controlled comparisons between architectures.

Zero-shot variant effect prediction: Like ESM-1v, CARP can score the effect of point mutations without fine-tuning, using the masked marginal log-likelihood as a fitness proxy, and generalizes this capability to longer sequences where transformer models are constrained.

Ready-to-use embeddings: The repository provides scripts for bulk embedding extraction from FASTA files with options for per-token representations and mean-pooled sequence embeddings, making integration into downstream pipelines straightforward.

Technical Details

Applications

Impact

CARP

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources

CARP

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

Tags

Resources