Overview

Ankh is an optimized protein language model (PLM) developed by Ahmed Elnaggar, Burkhard Rost, and colleagues at the Technical University of Munich. First released as a preprint in January 2023 and published in the Computational and Structural Biotechnology Journal in 2024, Ankh was designed around a central challenge in the PLM field: the assumption that performance requires ever-larger models. By systematically investigating masking strategies, architecture design, and training data composition across more than twenty experimental configurations, the team produced a model that achieves state-of-the-art results across a wide range of protein prediction tasks while requiring fewer than 10% of the parameters needed to train comparable models such as ESM-2 (15B).

The name Ankh references the Egyptian hieroglyph symbolizing life, reflecting the model's goal of unlocking general-purpose biological modelling. The Ankh Large variant (approximately 1.15 billion parameters) improves the average PLM benchmark by 4.8% over prior state-of-the-art models, while the smaller Ankh Base variant (approximately 450 million parameters) yields a 3.4% improvement using only 3% of the training parameters of leading alternatives. This efficiency advantage translates directly to accessibility: Ankh Large runs on a single A100 40 GB GPU, compared to four A100 80 GB GPUs required for ESM-2 inference at equivalent sequence lengths.

A subsequent Ankh2 series (including Ankh2 Large, approximately 2 billion parameters) extended the original work with additional training epochs and architectural refinements, maintaining the same philosophy of protein-specific optimization over brute-force scaling. All model variants are released on HuggingFace under the ElnaggarLab organization.

Key Features

Efficiency-first design: Ankh Large achieves state-of-the-art benchmark performance with fewer than 10% of the pre-training parameters and fewer than 7% of the inference parameters of ESM-2 (15B), enabling deployment on widely available single-GPU hardware.
T5-based encoder-decoder architecture: Unlike encoder-only PLMs, Ankh uses a T5-style encoder-decoder transformer with 1-gram span de-masking at 20% masking probability, which the authors found superior to other masking strategies through systematic ablations.
Relative positional embeddings: Ankh uses relative rather than absolute positional encodings (offset 128, dimension 64), allowing better generalization across protein sequences of varying lengths without retraining.
Significant inference speedup: Feature extraction runs 2.2x to 11.7x faster than ESM-2 depending on sequence length, reducing the computational cost of generating embeddings for large-scale screening tasks.
Multiple model scales: Four publicly available variants (Ankh Base, Ankh Large, Ankh3 Large, Ankh3 XL) allow users to match model capacity to available compute resources and downstream task requirements.
Broad benchmark coverage: Evaluated across secondary structure prediction, contact prediction, remote homology detection, fold classification, solubility prediction, and fluorescence prediction — covering the major axes of protein structure and function.

Technical Details

Ankh Large has an embedding dimension of 1536, 48 encoder layers, 24 decoder layers, 16 attention heads, and a feed-forward dimension of 3840. Ankh Base uses a smaller configuration with an embedding dimension of 768, 12 attention heads, and a feed-forward dimension of 3072. Both variants use Gated-GELU activations and relative positional embeddings. Pre-training used UniRef50 (approximately 45.6 million sequences), a choice justified empirically: lower-redundancy databases produced better downstream representations than UniRef90 or UniRef100, and BFD-augmented training did not consistently improve results. Training was conducted on Google TPU-v4 pods.

On CASP12 secondary structure prediction (Q3 accuracy), Ankh3 XL achieves 84.4% and Ankh Large achieves 83.6%. On SCOPe remote homology fold classification (1,194-class accuracy), Ankh Large reaches 61.0%. The model also improves on solubility and fluorescence benchmarks from TAPE, and excels at embedding-based contact prediction relative to attention-map-based approaches used in earlier models. The Ankh2 Large variant (~2 billion parameters) was trained from the Ankh Large checkpoint for 45 epochs using Adafactor optimization with linear warmup, and substitutes SiLU for GELU activations.

Applications

Ankh is designed as a general-purpose protein representation model suitable for fine-tuning or fixed-feature extraction across a wide range of downstream tasks. Researchers use it for secondary and tertiary structure prediction, protein function annotation, fitness landscape modeling, and variant effect prediction. The efficient inference profile makes it practical for large-scale proteomics workflows where embeddings must be generated for hundreds of thousands of sequences. Parameter-efficient fine-tuning via LoRA has been demonstrated for solubility and fluorescence prediction tasks. Ankh is also relevant to protein engineering applications where learning evolutionary conservation and mutation trends is important for generating diverse yet functionally coherent sequence variants.

Impact

Ankh represents a significant methodological contribution to the argument that protein-specific inductive biases — rather than raw scale — can drive PLM performance. By demonstrating competitive or superior performance versus models ten times its size, it has helped shift community attention toward efficient training strategies and knowledge-guided architecture choices. The model family is actively maintained, with Ankh2 and Ankh3 variants continuing to extend the original work. Its accessibility on commodity GPU hardware has made it a practical choice for research groups without access to large-scale compute, broadening participation in PLM-driven protein science. A notable limitation is that Ankh, like other sequence-only PLMs, does not incorporate structural information during pre-training, meaning that for tasks with known structural context, structure-aware models may offer complementary strengths.

Overview

Key Features

Efficiency-first design: Ankh Large achieves state-of-the-art benchmark performance with fewer than 10% of the pre-training parameters and fewer than 7% of the inference parameters of ESM-2 (15B), enabling deployment on widely available single-GPU hardware.

T5-based encoder-decoder architecture: Unlike encoder-only PLMs, Ankh uses a T5-style encoder-decoder transformer with 1-gram span de-masking at 20% masking probability, which the authors found superior to other masking strategies through systematic ablations.

Relative positional embeddings: Ankh uses relative rather than absolute positional encodings (offset 128, dimension 64), allowing better generalization across protein sequences of varying lengths without retraining.

Significant inference speedup: Feature extraction runs 2.2x to 11.7x faster than ESM-2 depending on sequence length, reducing the computational cost of generating embeddings for large-scale screening tasks.

Multiple model scales: Four publicly available variants (Ankh Base, Ankh Large, Ankh3 Large, Ankh3 XL) allow users to match model capacity to available compute resources and downstream task requirements.

Broad benchmark coverage: Evaluated across secondary structure prediction, contact prediction, remote homology detection, fold classification, solubility prediction, and fluorescence prediction — covering the major axes of protein structure and function.

Technical Details

Applications

Impact

Ankh

Overview

Key Features

Technical Details

Applications

Impact

Citation

Large language models and their applications in bioinformatics

Metrics

GitHub

Citations

Tags

Resources

Ankh

Overview

Key Features

Technical Details

Applications

Impact

Citation

Large language models and their applications in bioinformatics

Metrics

GitHub

Citations

Tags

Resources