Technical University of Munich
Optimized protein language model that surpasses state-of-the-art performance with fewer than 10% of the parameters of comparable models.
Ankh is an optimized protein language model (PLM) developed by Ahmed Elnaggar, Burkhard Rost, and colleagues at the Technical University of Munich. First released as a preprint in January 2023 and published in the Computational and Structural Biotechnology Journal in 2024, Ankh was designed around a central challenge in the PLM field: the assumption that performance requires ever-larger models. By systematically investigating masking strategies, architecture design, and training data composition across more than twenty experimental configurations, the team produced a model that achieves state-of-the-art results across a wide range of protein prediction tasks while requiring fewer than 10% of the parameters needed to train comparable models such as ESM-2 (15B).
The name Ankh references the Egyptian hieroglyph symbolizing life, reflecting the model's goal of unlocking general-purpose biological modelling. The Ankh Large variant (approximately 1.15 billion parameters) improves the average PLM benchmark by 4.8% over prior state-of-the-art models, while the smaller Ankh Base variant (approximately 450 million parameters) yields a 3.4% improvement using only 3% of the training parameters of leading alternatives. This efficiency advantage translates directly to accessibility: Ankh Large runs on a single A100 40 GB GPU, compared to four A100 80 GB GPUs required for ESM-2 inference at equivalent sequence lengths.
A subsequent Ankh2 series (including Ankh2 Large, approximately 2 billion parameters) extended the original work with additional training epochs and architectural refinements, maintaining the same philosophy of protein-specific optimization over brute-force scaling. All model variants are released on HuggingFace under the ElnaggarLab organization.
Ankh Large has an embedding dimension of 1536, 48 encoder layers, 24 decoder layers, 16 attention heads, and a feed-forward dimension of 3840. Ankh Base uses a smaller configuration with an embedding dimension of 768, 12 attention heads, and a feed-forward dimension of 3072. Both variants use Gated-GELU activations and relative positional embeddings. Pre-training used UniRef50 (approximately 45.6 million sequences), a choice justified empirically: lower-redundancy databases produced better downstream representations than UniRef90 or UniRef100, and BFD-augmented training did not consistently improve results. Training was conducted on Google TPU-v4 pods.
On CASP12 secondary structure prediction (Q3 accuracy), Ankh3 XL achieves 84.4% and Ankh Large achieves 83.6%. On SCOPe remote homology fold classification (1,194-class accuracy), Ankh Large reaches 61.0%. The model also improves on solubility and fluorescence benchmarks from TAPE, and excels at embedding-based contact prediction relative to attention-map-based approaches used in earlier models. The Ankh2 Large variant (~2 billion parameters) was trained from the Ankh Large checkpoint for 45 epochs using Adafactor optimization with linear warmup, and substitutes SiLU for GELU activations.
Ankh is designed as a general-purpose protein representation model suitable for fine-tuning or fixed-feature extraction across a wide range of downstream tasks. Researchers use it for secondary and tertiary structure prediction, protein function annotation, fitness landscape modeling, and variant effect prediction. The efficient inference profile makes it practical for large-scale proteomics workflows where embeddings must be generated for hundreds of thousands of sequences. Parameter-efficient fine-tuning via LoRA has been demonstrated for solubility and fluorescence prediction tasks. Ankh is also relevant to protein engineering applications where learning evolutionary conservation and mutation trends is important for generating diverse yet functionally coherent sequence variants.
Ankh represents a significant methodological contribution to the argument that protein-specific inductive biases — rather than raw scale — can drive PLM performance. By demonstrating competitive or superior performance versus models ten times its size, it has helped shift community attention toward efficient training strategies and knowledge-guided architecture choices. The model family is actively maintained, with Ankh2 and Ankh3 variants continuing to extend the original work. Its accessibility on commodity GPU hardware has made it a practical choice for research groups without access to large-scale compute, broadening participation in PLM-driven protein science. A notable limitation is that Ankh, like other sequence-only PLMs, does not incorporate structural information during pre-training, meaning that for tasks with known structural context, structure-aware models may offer complementary strengths.
Sarumi, O. & Heider, D. (2024) Large language models and their applications in bioinformatics. Computational and Structural Biotechnology Journal.
DOI: 10.1016/j.csbj.2024.09.031