CNN-based protein language model series showing convolutions match transformer performance on sequence pretraining while scaling linearly with sequence length.
CARP (Convolutional Autoencoding Representations of Proteins) is a family of CNN-based protein language models developed by Kevin Yang, Alex Lu, and Nicolo Fusi at Microsoft Research New England. The central finding is a challenge to a prevailing assumption in the field: that transformer architectures are inherently superior to convolutional neural networks for learning representations from protein sequences. CARP demonstrates that CNNs trained with the same masked language modeling objective as transformer-based models — on the same dataset — achieve competitive or better performance across a broad set of downstream biological prediction tasks.
The work appeared first as a bioRxiv preprint in May 2022 and was subsequently published in Cell Systems in March 2024. At the time of its release, the dominant protein language models were transformer-based (ESM-1b, ESM-1v, ProtBERT), and the transformer's self-attention mechanism was widely credited with enabling long-range residue dependency modeling. CARP questions whether this architectural choice is as decisive as commonly assumed, at least for sequence-level representation learning.
A practically significant advantage of the convolutional approach is its scaling behavior. Transformer self-attention scales quadratically with sequence length, which imposes hard limits on the sequence lengths that can be processed without architectural modifications or memory-intensive workarounds. CARP's ByteNet-based architecture scales linearly with sequence length and, crucially, does not rely on positional embeddings, allowing it to generalize to sequences of arbitrary length at inference time — tested up to 4,096 residues. This property is particularly relevant for proteins with very long sequences that exceed the limits of transformer-based models like ESM-1b.
CARP models use a ByteNet encoder architecture — a hierarchical dilated causal convolutional network originally developed for text generation and adapted here for bidirectional protein sequence modeling. Dilated convolutions allow the receptive field to grow exponentially with network depth, capturing long-range dependencies without the memory cost of full self-attention. The largest model, CARP-640M, contains approximately 640 million parameters, directly comparable to ESM-1b's 650 million parameters. All models are pretrained on the March 2020 release of UniRef50 using a BERT-style masked language modeling (MLM) objective, where a fraction of input residues are masked and the model is trained to recover the original amino acid token from the remaining context.
Evaluation against ESM models on the ProteinGym benchmark — spanning fluorescence, stability, thermostability, and fitness prediction datasets — showed CARP-640M competitive with ESM-1b across the suite and comparable to ESM-1v on zero-shot mutation scoring. For structure-related tasks such as secondary structure and contact prediction, CARP linear probes achieve results in the same range as equivalently sized ESM models, underscoring that long-range structural information is accessible from convolutional representations. Inference on CARP-640M is fast and memory-efficient relative to transformer baselines of similar parameter count, particularly on long sequences where the quadratic scaling of attention becomes the dominant cost.
CARP is used in protein engineering workflows where efficient sequence embedding is needed, particularly for scanning large mutant libraries or working with unusually long proteins. The zero-shot mutation effect prediction capability makes it directly applicable to variant effect scoring for enzymes, antibodies, and other engineered proteins without requiring labeled training data for each target. The availability of multiple model sizes (600k through 640M parameters) allows practitioners to balance speed and accuracy: smaller CARP models are practical for rapid screening of millions of variants, while CARP-640M is suitable for applications where maximum prediction accuracy matters. Researchers working with long protein sequences — such as full-length filamentous proteins, multi-domain receptors, or non-standard organisms with unusually long coding sequences — benefit from CARP's lack of sequence length constraints.
CARP made an important methodological contribution by demonstrating that the dominance of transformers in protein language modeling is not absolute. By holding the pretraining objective and dataset fixed while varying only the architecture, Yang et al. provided clean evidence that CNN-based representations can match transformer-based ones on standard benchmarks. This has influenced thinking about the principles underlying protein language model performance and motivated follow-on work examining what aspects of architecture matter most for biological sequence modeling. A notable limitation is that CARP, like all single-sequence protein language models, does not incorporate evolutionary information from MSAs during inference, which constrains its ability to match MSA-augmented models like ESM-MSA-1b on contact prediction tasks. Additionally, while CARP generalizes to long sequences, it was evaluated primarily on single-chain proteins and its behavior on complex multi-domain or intrinsically disordered sequences remains less thoroughly benchmarked than in transformer-based successors.