Google Research
Sparse attention transformer extending BERT to sequences up to 8x longer via random, local, and global attention patterns, with demonstrated applications in genomic sequence modeling.
Big Bird is a sparse attention transformer model developed by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed at Google Research. Published at NeurIPS 2020, Big Bird addresses the quadratic memory and compute scaling of standard transformer self-attention with respect to sequence length — a fundamental bottleneck that limits BERT-style models to sequences of roughly 512 tokens — by replacing full attention with a sparse combination of random, local, and global attention patterns that achieves linear scaling. The model was proposed primarily as a general-purpose NLP advance but explicitly demonstrated applications to genomic sequences, where regulatory elements can span tens of thousands of base pairs that far exceed the context windows of previous transformer-based methods.
The attention bottleneck in standard transformers is particularly limiting for genomic applications. DNA sequences are long: a typical gene may span tens to hundreds of kilobases, and the enhancers that regulate it may be located hundreds of kilobases away. Standard transformer models like BERT process at most 512 tokens, corresponding to roughly 512 base pairs of DNA — far less than the regulatory context required to capture enhancer-promoter interactions. Big Bird's sparse attention mechanism, by contrast, can theoretically process sequences up to 8x longer than what was previously feasible on comparable hardware, opening a path toward transformer-based genomic sequence modeling at biologically relevant scales.
Big Bird's sparse attention combines three complementary attention patterns: random attention (each query attends to a set of randomly selected keys), local window attention (each query attends to a sliding window of neighboring keys), and global token attention (a small set of designated global tokens attend to and are attended to by all positions). This combination is theoretically motivated: Big Bird proves that this sparse attention pattern is a universal approximator of sequence functions and is Turing complete, preserving the theoretical expressiveness of full attention while reducing computational complexity from O(n²) to O(n) for sequence length n. The model was applied to genomic sequence tasks including promoter-region prediction and chromatin profile classification, where its extended context window yielded state-of-the-art performance over previous transformer and CNN-based genomic models.
Big Bird implements sparse attention through the block sparse attention mechanism, where the sequence is divided into blocks of equal size and attention is computed between selected block pairs rather than all position pairs. Specifically, for each block of queries, Big Bird attends to: (1) a fixed set of r random blocks, (2) the w/2 adjacent blocks to each side (sliding window), and (3) a set of g globally attending tokens that participate in all attention computations. Global tokens are prepended or inserted into the sequence and serve as a compressed summary of global context, enabling all positions to exchange information through the global token intermediary without direct all-pairs communication. For genomic pre-training, sequences of up to 4,096 nucleotide tokens were used — roughly 4 kilobases of DNA — which, while still modest compared to the full extent of long-range regulatory interactions, represented a 4-8x improvement over BERT-style context windows. The masked language model pre-training objective masked 15% of nucleotide tokens and trained the model to reconstruct them from sparse-attention context. On the genomics benchmarks reported in the original NeurIPS 2020 paper, Big Bird achieved state-of-the-art performance on promoter-region prediction (distinguishing human promoters from non-promoter sequences) and chromatin profile classification (predicting 919 chromatin features from the DeepSEA benchmark), outperforming both CNN-based models and standard BERT trained on shorter sequence windows. A model with approximately 124 million parameters was used for the genomics experiments.
Big Bird is applicable to any genomic prediction task that benefits from extended sequence context beyond what standard BERT-scale models can process. In regulatory genomics, the model's extended context window enables it to capture promoter-distal regulatory elements that influence transcription initiation, alternative splicing signals distributed across large pre-mRNA spans, and long interspersed noncoding RNA sequence features. The model has been applied to promoter classification, chromatin state prediction, and TF binding site identification in settings where long-range context improves discriminative accuracy. Beyond specific genomic tasks, Big Bird's architecture has been influential as a technical foundation: HyenaDNA and other genomic long-sequence models that followed were partly motivated by demonstrating advantages over Big Bird's sparse attention approach for very long sequences (up to millions of base pairs). In the NLP domain, Big Bird remains a practical solution for document-level tasks requiring context beyond 512 tokens, with deployed applications in long-document question answering and summarization.
Big Bird was a landmark contribution to the broader problem of scaling transformers to long sequences, a challenge that has become central to large language model development across domains. In genomics specifically, it demonstrated that transformer architectures could be adapted for DNA sequence modeling at biologically relevant scales — a key proof of concept that motivated the subsequent wave of genomic language models including HyenaDNA, Nucleotide Transformer, DNABERT-2, and Caduceus. The NeurIPS 2020 paper has been cited thousands of times across NLP and biology, and the Google Research GitHub repository has been widely used as a reference implementation for sparse attention. A key limitation for genomics applications is that even Big Bird's extended context window of approximately 4,096 tokens falls far short of the 40-524 kilobase windows used by Basenji, Enformer, and Borzoi, which achieve substantially better regulatory prediction performance by accepting 100x longer inputs. State-space model architectures such as Mamba and convolutional approaches such as HyenaDNA have since achieved even more efficient long-sequence scaling than sparse attention for the specific use case of DNA modeling. Nevertheless, Big Bird's theoretical framework and empirical genomics results were important catalysts for the field's engagement with long-sequence modeling as a central challenge in biological AI.