Overview

Protein language models (pLMs) have emerged as a powerful paradigm for learning rich representations of amino acid sequences without requiring explicit structural labels. However, most existing models, including the widely used ESM family, are pre-trained primarily using masked language modeling objectives that treat residues somewhat independently, leaving residue-residue co-evolutionary signals—the basis of much structural biology—underexploited. SFM-Protein, introduced by researchers at Microsoft Research AI for Science in October 2024, addresses this gap with a pre-training strategy explicitly designed to capture how residues co-evolve across both short and long ranges in a protein chain.

The central innovation is a dual pre-training objective that combines a local span prediction task with a global pairwise residue interaction task. The span prediction head recovers masked BPE (byte-pair encoded) tokens, capturing short-range dependencies relevant to secondary structure formation. The pairwise prediction head uses outer product operations over hidden states to model long-range residue-residue co-occurrence patterns analogous to those mined from multiple sequence alignments (MSAs) in tools like AlphaFold2—but derived entirely from single sequences. This design allows SFM-Protein to approximate co-evolutionary information without the computational overhead of constructing MSAs at inference time.

SFM-Protein is part of Microsoft's broader Science Foundation Model (SFM) initiative, which aims to build large-scale foundation models across multiple scientific domains. It was benchmarked at two scales—650 million and 3 billion parameters—demonstrating that both the architecture and training objective contribute to performance gains over ESM2 models of comparable size across a diverse set of downstream tasks.

Key Features

Dual co-evolutionary pre-training objective: Combines a local span prediction loss and a global pairwise residue interaction loss through a composite objective (L = alpha * L_global + (1-alpha) * L_local), enabling the model to capture both secondary structure-level and tertiary contact-level information from sequence data alone.
No MSA required at inference: Unlike AlphaFold2 or structure-prediction models that depend on multiple sequence alignments, SFM-Protein encodes co-evolutionary information during pre-training, making it fully applicable to orphan or poorly characterized proteins with few known homologs.
Scalable transformer encoder with RoPE: The architecture uses a bidirectional transformer encoder with Rotary Positional Embeddings (RoPE), enabling effective capture of positional relationships across long sequences packed into 8192-token context windows.
Broad benchmark coverage: Evaluated on function prediction (Gene Ontology, Enzyme Commission), fitness landscape modeling (fluorescence, stability), solubility classification, and antibody CDR-H3 design, establishing generalizability across diverse protein biology tasks.
Two model scales: Available at 650M and 3B parameters, providing flexibility between computational cost and downstream accuracy, with the 3B model consistently achieving the highest benchmark scores.

Technical Details

SFM-Protein is a bidirectional transformer encoder pre-trained on UniRef50, a database of protein sequences clustered at 50% sequence identity, comprising over 62 million sequences and 17 billion residues. During pre-training, sequences are packed into 8192-token chunks with a 30% masking ratio applied for the span prediction task. The pairwise prediction head computes residue-pair representations via outer products of token-level hidden states and maps these to co-occurrence targets, supervised against patterns derived from the pre-training corpus itself. The composite loss weight alpha balances the contribution of global versus local objectives during training.

At 3B parameters, SFM-Protein achieves Spearman correlations of 0.823 on protein stability prediction and 0.683 on fluorescence regression from the FLIP benchmarks. On enzyme function prediction (EC numbers), the 3B model reaches 0.869 F1-max and 0.893 AUPRC. On Gene Ontology molecular function prediction, it achieves 0.673 F1-max. For antibody CDR-H3 design by sequence completion, the 650M model attains 54.6% amino acid recovery. These results are competitive with or exceed ESM2 models of equivalent parameter count across most evaluated tasks.

Applications

SFM-Protein is suited for researchers who need high-quality protein sequence embeddings as input to downstream predictive models—particularly in cases where MSA construction is impractical due to sequence novelty or computational constraints. It is applicable to enzyme function annotation, stability and fitness landscape screening for protein engineering, solubility prediction in expression optimization workflows, and antibody sequence design. The model's single-sequence inference mode makes it straightforward to deploy at scale in high-throughput screening pipelines.

Impact

SFM-Protein demonstrates that co-evolutionary information—long considered to require MSA construction—can be meaningfully encoded into a protein language model through appropriately designed pre-training objectives. This has practical implications for studying proteins in under-characterized proteomes and for rapid computational screening where MSA construction would be a bottleneck. As a preprint from Microsoft Research AI for Science, the work contributes to growing evidence that scaling both model size and training objective sophistication yields compounding benefits for protein representation learning. It also aligns with the broader SFM initiative's goal of building scientific foundation models that generalize across domains. No public code repository has been released as of the time of writing, which limits immediate reproducibility and community adoption.

Overview

Key Features

Dual co-evolutionary pre-training objective: Combines a local span prediction loss and a global pairwise residue interaction loss through a composite objective (L = alpha * L_global + (1-alpha) * L_local), enabling the model to capture both secondary structure-level and tertiary contact-level information from sequence data alone.

No MSA required at inference: Unlike AlphaFold2 or structure-prediction models that depend on multiple sequence alignments, SFM-Protein encodes co-evolutionary information during pre-training, making it fully applicable to orphan or poorly characterized proteins with few known homologs.

Scalable transformer encoder with RoPE: The architecture uses a bidirectional transformer encoder with Rotary Positional Embeddings (RoPE), enabling effective capture of positional relationships across long sequences packed into 8192-token context windows.

Broad benchmark coverage: Evaluated on function prediction (Gene Ontology, Enzyme Commission), fitness landscape modeling (fluorescence, stability), solubility classification, and antibody CDR-H3 design, establishing generalizability across diverse protein biology tasks.

Two model scales: Available at 650M and 3B parameters, providing flexibility between computational cost and downstream accuracy, with the 3B model consistently achieving the highest benchmark scores.

Technical Details

Applications

Impact

SFM-Protein

Overview

Key Features

Technical Details

Applications

Impact

Citation

SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation

Metrics

Citations

Tags

Resources

SFM-Protein

Overview

Key Features

Technical Details

Applications

Impact

Citation

SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation

Metrics

Citations

Tags

Resources