Benchmark suite of five biologically relevant tasks for evaluating protein sequence representation learning, covering structure, homology, and engineering.
Tasks Assessing Protein Embeddings (TAPE) is a benchmark suite introduced at NeurIPS 2019 by Roshan Rao, Nicholas Bhattacharya, Neil Thomas, and colleagues from UC Berkeley. It was developed to address a fundamental gap in computational protein biology: despite rapid growth in semi-supervised learning approaches for protein sequences, the field lacked a standardized framework for fairly comparing methods across meaningful biological tasks. TAPE filled this gap by defining five curated downstream tasks, fixed train/validation/test splits, a shared pretraining corpus, and reference model implementations.
The benchmark spans three broad areas of protein biology — structure prediction, evolutionary analysis, and protein engineering — chosen to reflect real scientific questions rather than arbitrary held-out sets. Each task is designed so that improvements on it correspond to genuine biological capability, not overfitting to a particular dataset's quirks. This design philosophy made TAPE an influential reference point for the wave of protein language models that followed, from ESM to ProtTrans.
TAPE also provided pretrained weights for five models (LSTM, Transformer, ResNet, UniRep, and a Bepler embedding model) alongside the evaluation code and datasets, making it possible for researchers to benchmark new architectures quickly without rebuilding infrastructure from scratch.
TAPE is not itself a single neural network but an evaluation framework encompassing multiple model families. The five reference architectures range from recurrent networks (LSTM, UniRep) to a transformer and a residual convolutional network, all pretrained on Pfam sequences using a masked or autoregressive language modeling objective. Parameter counts vary by architecture; the transformer model is roughly comparable in scale to early BERT-style models.
The five tasks use distinct metrics suited to each problem type: per-residue accuracy for secondary structure (CB513 dataset), precision at L/5 for long-range contact prediction (ProteinNet data), fold classification accuracy for remote homology (1,195 SCOP fold classes), and Spearman rank correlation for the two protein engineering regression tasks (GFP fluorescence and Envision stability datasets). The original benchmarking results showed that pretraining improved performance across nearly all model-task combinations — in some cases more than doubling scores — but that alignment-based features still outperformed pretrained neural representations on contact prediction and remote homology detection, underscoring that sequence co-evolution signals captured by multiple sequence alignments were not yet fully replicated by single-sequence language models.
TAPE is primarily a tool for researchers developing or evaluating protein sequence models. A new protein language model can be plugged into the TAPE evaluation harness and benchmarked against published baselines without designing new experiments. This is directly useful during model development, ablation studies, and peer review. Beyond model comparison, the individual TAPE tasks have practical relevance: secondary structure and contact prediction inform structural biology workflows, remote homology detection underpins functional annotation of novel sequences, and the protein engineering tasks evaluate whether models can guide directed evolution campaigns by accurately ranking mutation effects.
TAPE established a common language for evaluating protein representation learning at a formative period in the field. Published in 2019, it preceded and directly influenced the design of successor benchmarks and the evaluation strategies used by prominent models including ESM, ProtTrans, and ProteinBERT. The benchmark's central finding — that self-supervised pretraining helps but alignment-based methods remain competitive — motivated subsequent work to close that gap, eventually contributing to the development of models capable of matching or exceeding MSA-based baselines without explicit evolutionary inputs. TAPE's codebase and datasets remain in active use, and the benchmark has accumulated hundreds of citations. Its main limitation as a benchmark is that it predates many of the most capable modern models, and top-performing systems now saturate several of its tasks, reducing its power to discriminate among state-of-the-art methods.