Prescient Design / Genentech
Efficient protein language model library from Prescient Design enabling high-quality sequence representations and fitness prediction in 24 GPU hours.
Training state-of-the-art protein language models has historically required computational resources that place them out of reach for most academic and industry research groups. Models such as ESM2-3B consume hundreds of thousands of GPU hours for pre-training, concentrating foundational capability development at a handful of well-resourced institutions and making it difficult to experiment with novel architectures, training objectives, or data mixtures at scale. LOBSTER — Language models for Biological Sequence Transformation and Evolutionary Representation — is an open-source library and accompanying set of pre-trained models developed by the Frey Lab at Prescient Design, Genentech, that directly confronts this bottleneck.
Released in May 2024, LOBSTER defines a "cramming" challenge for protein language models: train the best possible model within a fixed computational budget of 24 hours on a single GPU. By systematically re-examining every aspect of the pre-training pipeline — including tokenization, architecture depth, attention mechanisms, learning rate schedules, and data preprocessing — the team trained a 67-million parameter model that achieves comparable performance on protein fitness landscape inference benchmarks to ESM2-3B, a model that required more than 15,000 times as many GPU hours. This result reframes the question of what constitutes a "good" protein language model and opens the door for the research community to iterate on modeling choices that would otherwise be prohibitively expensive to explore.
LOBSTER is a living, batteries-included library, meaning it ships not just with model weights but with the full training, fine-tuning, and inference stack needed to reproduce results and extend them. The library was developed by Nathan Frey, Taylor Joren, Aya Abdelsalam Ismail, Allen Goodman, Richard Bonneau, Kyunghyun Cho, and Vladimir Gligorijević, representing a collaboration between Prescient Design's Frey Lab and New York University. The project treats protein language modeling as an ongoing scientific investigation rather than a one-time engineering artifact, releasing updated code and pre-trained checkpoints as the understanding of optimal modeling choices evolves. This design philosophy positions LOBSTER as a platform for the community to run controlled experiments that address fundamental questions in biological sequence modeling — questions that are otherwise intractable because of the GPU cost of the incumbent models.
The core LOBSTER model is a BERT-style encoder-only transformer trained with masked language modeling (MLM) on protein sequences. The 67M-parameter variant uses a standard transformer architecture with multi-head self-attention, and training is carried out on a curated subset of UniRef sequences that fits within the 24-hour budget on a single GPU. The authors systematically evaluated the impact of design choices including sequence tokenization (per-residue character encoding versus k-mer schemes), learning rate schedules (cosine decay with warmup), batch size scaling, and the trade-off between model depth and width. Their key finding is that many of the architectural decisions carried over from natural language processing into protein language modeling — including very large model sizes — are not necessary to achieve competitive downstream task performance.
Evaluation was conducted using protein fitness landscape inference benchmarks drawn from the ProteinGym substitution benchmark suite, which provides a standardized framework for measuring how well model representations predict the functional consequences of single and multiple amino acid mutations. On these benchmarks, the LOBSTER 67M model matches or approaches the performance of ESM2-3B (3 billion parameters), a result that highlights the inefficiency of prior scaling without architectural re-examination. The concept bottleneck variant (LobsterCBMPMLM) extends the base architecture with a structured intermediate representation layer where model activations correspond to user-specified biophysical concepts such as hydrophobicity, charge, and secondary structure propensity, enabling interpretable control during protein design. The causal decoder variant (LobsterPCLM) uses a Llama-style architecture for autoregressive sequence generation tasks, broadening applicability to de novo protein design workflows that require sampling from a generative model rather than computing representations of existing sequences.
The LOBSTER codebase is implemented in PyTorch with a modular design that separates the pre-training data pipeline, model architecture, and downstream evaluation into independent components. Pre-trained checkpoints are distributed through the Hugging Face model hub, allowing researchers to pull model weights and begin fine-tuning or inference without setting up a full training environment. The library includes training configurations for GPU memory budgets ranging from single consumer-grade GPUs to multi-GPU clusters, with documented scaling behavior for each configuration.
LOBSTER is designed for research groups that need high-quality protein representations but cannot afford to train or even fine-tune very large models. The primary use case is protein fitness prediction: given a set of protein variants, LOBSTER embeddings can be combined with simple supervised heads to predict functional properties such as binding affinity, thermostability, catalytic activity, or expression yield. This makes the model directly applicable to the analysis of deep mutational scanning datasets, where researchers measure the fitness of thousands of protein variants simultaneously and need a model that can generalize to unseen sequences. The concept bottleneck variant is specifically targeted at therapeutic protein design, where regulatory and scientific credibility requires interpretable models whose predictions can be traced to specific biophysical features. Beyond fitness prediction, LOBSTER's causal variant supports de novo sequence generation tasks, including the design of protein libraries with specified compositional properties. The library also serves as an educational and experimental platform for the protein machine learning community: because the cramming framework is explicitly designed for rapid iteration, LOBSTER is well-suited for graduate courses and research groups exploring new pre-training objectives without needing access to a compute cluster.
LOBSTER's central contribution is demonstrating that the protein language modeling field has been leaving significant performance on the table by scaling model size without first optimizing the pre-training recipe. By showing that a 67M-parameter model trained in one GPU-day can approach the benchmark performance of a 3B-parameter model trained over months, the work challenges the assumption that bigger is always better and creates a pathway for resource-constrained researchers to participate in protein foundation model development. The open-source release has been adopted as a research platform by the protein machine learning community, and the concept bottleneck extension has generated interest in interpretable AI approaches for therapeutic protein engineering. The Frey Lab has continued to extend the library with new architectures and training objectives, and the repository serves as a living record of systematic improvements to efficient protein language model training. A key limitation of the current approach is that the cramming regime necessarily involves a trade-off: while LOBSTER matches ESM2-3B on fitness benchmarks, very large models trained on much more data may still hold advantages on tasks that require broader protein universe coverage, such as remote homology detection or zero-shot prediction of structurally diverse protein families.