University of Birmingham
LoRA fine-tuning framework for ESM-2 with multi-head attention pooling and contact map enhancement for sequence-only protein property prediction across diverse tasks.
Protein language models (PLMs) have become foundational tools in computational biology, encoding decades of evolutionary information into dense vector representations that support a wide variety of downstream prediction tasks. Yet adapting these large pretrained models to specific prediction problems — predicting enzymatic function, thermostability, subcellular localization, or fitness landscapes — requires careful fine-tuning strategies that balance adaptation quality against overfitting risk, computational cost, and the typically limited size of task-specific labeled datasets. Full fine-tuning of models at the scale of ESM-2 (which spans from 35M to 15B parameters) is frequently impractical on standard academic hardware for datasets of a few thousand proteins, and naive linear probing (freezing all pretrained weights and training only a classification head) often leaves substantial predictive performance on the table.
SeqProFT, developed by Shuo Zhang and Jian K. Liu at the School of Computer Science, University of Birmingham, proposes a comprehensive framework that addresses these trade-offs systematically. The name stands for Sequence-only Protein property prediction with Fine-Tuning, reflecting the core design decision to use amino acid sequences as the sole input modality — no structural information, no multiple sequence alignments — and to adapt the ESM-2 model family using Low-Rank Adaptation (LoRA) for parameter-efficient end-to-end fine-tuning. First posted to arXiv in November 2024 (arXiv:2411.11530) and published in IEEE Transactions on Artificial Intelligence in 2025, SeqProFT introduces a family of downstream prediction heads with progressively richer pooling mechanisms, and demonstrates that contact map information — predicted directly from the ESM-2 representations — can be used to modulate attention during aggregation, effectively injecting an implicit structural signal without requiring an explicit structure as input.
The contribution of SeqProFT is primarily methodological: it provides a carefully engineered and comprehensively benchmarked workflow for ESM-2 fine-tuning that outperforms both simpler fine-tuning baselines and more complex approaches across a panel of ten diverse protein property prediction tasks. By making this workflow publicly available on GitHub with complete training scripts and configuration files, SeqProFT gives the broader computational biology community a reliable, off-the-shelf starting point for applying ESM-2 to new protein annotation problems.
End-to-end LoRA fine-tuning of ESM-2: SeqProFT applies LoRA to the query, key, and value projection matrices within the ESM-2 attention layers, with a default rank of 32 and scale factor alpha of 32. This configuration updates fewer than 2% of total model parameters while enabling the backbone to adapt its representations toward the prediction task, striking a favorable balance between adaptation quality and overfitting resistance on datasets ranging from a few hundred to tens of thousands of proteins.
Three downstream head architectures of increasing sophistication: SeqProFT evaluates three prediction head designs — (1) a Simple MLP Head (SMH) combining linear layers and feedforward blocks with an attention pooling layer, (2) a Multihead Attention Head (MAH) with explicit query/key/value attention over the per-residue ESM-2 embeddings, and (3) a Contact Map-Enhanced MAH (CM-MAH) where the attention weights are modulated by the protein's predicted contact map. This progression allows the framework to be matched to task complexity and dataset characteristics.
Contact map enhancement without structural input: The CM-MAH head is the most distinctive component of SeqProFT. ESM-2 attention maps contain implicit co-evolutionary contact information, and contact map prediction from ESM-2 is a well-established capability. SeqProFT uses the predicted contact map as a spatial attention bias that encourages the pooling mechanism to weight residues based on their structural neighborhood relationships. This injects an implicit structural prior into the aggregation step without requiring AlphaFold or experimental structure as input, keeping the model strictly sequence-only while leveraging structural information that is already latent in the pretrained model.
Benchmarked across ten diverse protein property tasks: SeqProFT is evaluated on enzyme commission (EC) number classification, gene ontology (GO) biological process and molecular function prediction, remote homology detection, secondary structure prediction, subcellular localization, fluorescence prediction, stability prediction, GB1 fitness landscape prediction, and eSOL solubility prediction. This breadth of evaluation provides a reliable picture of the framework's general applicability rather than performance on a narrow set of tasks.
Practical hardware accessibility: All experiments reported in the paper are conducted on a single NVIDIA A100 40 GB GPU. The LoRA configuration ensures that ESM-2-650M — the primary model evaluated — can be fine-tuned within this single-GPU budget for all tasks, making the framework reproducible in standard academic computing environments without access to multi-GPU clusters.
Systematic hyperparameter study for LoRA rank: The paper includes an ablation study varying LoRA rank across values of 1, 2, 4, 8, 16, and 32, showing that performance generally improves up to rank 32 for larger datasets while lower ranks are preferable for smaller datasets to control overfitting. This guidance helps practitioners select appropriate LoRA configurations for their specific dataset sizes without exhaustive tuning.
SeqProFT's backbone is the ESM-2 model, available in 35M, 150M, and 650M parameter configurations (the 650M version is the primary evaluation model). LoRA adapters are inserted into the attention projection layers using the standard PEFT library, with the rank parameter r and scaling factor alpha both set to 32 as the default configuration. The remainder of the backbone (feedforward layers, layer norms, positional embeddings) remains frozen during training. The three head architectures each take the per-residue ESM-2 hidden states (dimension 1280 for the 650M model) as input and produce a protein-level embedding through aggregation, followed by a final prediction layer.
In the SMH, aggregation is performed by a learned attention pooling layer that produces a single weighted average of the per-residue representations, then passed through two feedforward blocks before the final prediction. In the MAH, the protein-level representation is computed through standard multi-head attention over the per-residue embeddings, with learned query vectors that gather complementary information from different residue subsets. In the CM-MAH, the attention weights computed by the MAH are additively biased by the contact probability matrix predicted from the ESM-2 attention maps, causing the pooling to give higher weight to residue pairs that are predicted to be spatially proximate — a form of structure-informed global pooling that remains compatible with sequence-only inference.
Training uses the AdamW optimizer with gradient accumulation over 16 samples and mixed-precision (fp16) computation. Task-dependent learning rates range from 5×10⁻⁶ to 5×10⁻⁴, with cosine learning rate decay. Training runs for 10–50 epochs depending on dataset size. For EC classification (one of the larger evaluated datasets, with tens of thousands of labeled enzymes), SeqProFT-650M achieves an F1 score of 0.887, while for GO biological process it achieves 0.460 (the task is intrinsically harder due to the sparse and noisy nature of GO annotations). On the GB1 fitness landscape dataset (a protein engineering benchmark measuring mutational fitness), the model achieves a Spearman correlation of 0.958, demonstrating strong generalization to protein fitness prediction. On secondary structure prediction, accuracy reaches 82.5%, and on subcellular localization classification it achieves 83.0% accuracy. The CM-MAH head provides the most consistent improvements over SMH and MAH on tasks where local structural context is functionally informative, notably EC classification and remote homology detection.
SeqProFT is designed to be a general-purpose fine-tuning framework for any researcher who needs to predict protein properties from sequence. Its primary application domain is functional annotation: given a newly sequenced or hypothetical protein, SeqProFT can rapidly predict its enzyme commission classification, gene ontology annotations, subcellular localization, and thermostability, enabling automated annotation pipelines for proteome-scale analysis. For protein engineers and directed evolution practitioners, the fitness prediction capability on the GB1 landscape suggests that SeqProFT can be applied to predict mutational fitness effects and guide library design. Pharmaceutical researchers can use the stability and solubility prediction models (trained on eSOL data) to screen protein therapeutic candidates for developability properties before committing to experimental expression and characterization. The contact map-enhanced head is particularly valuable in applications where local structural context matters — active site prediction, allosteric residue identification — even when no structure is available. The publicly available GitHub repository includes example configuration files for several of the benchmark tasks, enabling practitioners to adapt the framework to new tasks by modifying training data and hyperparameters rather than reimplementing the core methodology.
SeqProFT makes a practical contribution to the protein ML field by providing a rigorously benchmarked, hardware-accessible, and publicly released workflow for LoRA-based ESM-2 fine-tuning across a diverse set of biologically meaningful prediction tasks. The contact map-enhanced pooling head represents a novel architectural idea — using predicted contact maps as attention biases within the downstream head — that effectively bridges sequence-only and structure-aware prediction paradigms without incurring the cost of explicit structure prediction. The systematic evaluation across ten tasks provides the field with a useful benchmark baseline and demonstrates that parameter-efficient fine-tuning with appropriately designed prediction heads consistently outperforms simpler approaches across diverse protein property types.
A meaningful limitation is that SeqProFT was evaluated without comparison to the most capable fully fine-tuned or specialized models for each individual task (e.g., specialized enzyme function predictors or dedicated thermostability models), meaning the magnitude of the improvement relative to task-specific state of the art is not fully characterized in the paper. The framework also does not extend to multi-task learning — each prediction head is trained independently on each task, so the model does not benefit from transfer between related tasks. The work also predates the availability of ESM-3 and ESM-C (also called ESM Cambrian), which offer substantially more capable protein representations than ESM-2; adapting the SeqProFT framework to these newer backbones is a natural extension. As published in IEEE Transactions on Artificial Intelligence in 2025, SeqProFT is a peer-reviewed contribution that provides a trustworthy methodological reference for researchers entering the protein property prediction space.