Beijing Institute of Technology
An integrated platform implementing 155 biological language models for analyzing DNA, RNA, and protein sequences across residue-level and sequence-level tasks.
BioSeq-BLM is a unified computational platform developed by Hong-Liang Li, Yi-He Pang, and Bin Liu at the Beijing Institute of Technology's School of Computer Science and Technology. Published in Nucleic Acids Research in December 2021, the platform addresses a fundamental challenge in computational biology: the lack of a systematic, integrated framework for applying the full spectrum of natural language processing techniques to biological sequence analysis. Prior to BioSeq-BLM, researchers working on DNA, RNA, or protein classification tasks faced a fragmented landscape of tools, each supporting only a narrow slice of available feature representations and machine learning methods.
The platform treats biological sequences as natural language and organizes 155 biological language models (BLMs) into four complementary families: Biological Grammar Language Models (BGLMs), Biological Statistical Language Models (BSLMs), Biological Neural Language Models (BNLMs), and Biological Semantic Similarity Language Models (BSSLMs). Each family captures distinct properties of sequence data, from syntactic k-mer rules and word co-occurrence statistics to deep neural embeddings and sequence similarity scores. This comprehensive taxonomy allows researchers to benchmark representations systematically and select the most appropriate model for a given biological prediction task without switching between disparate software environments.
BioSeq-BLM extends the earlier BioSeq-Analysis2.0 platform from the same group, nearly doubling the number of supported grammar models, introducing statistical and neural language model families absent from the predecessor, and adding GPU-accelerated deep learning classifiers. A web server at bliulab.net/BioSeq-BLM and a downloadable standalone package make the platform accessible to both bioinformaticians comfortable with command-line tools and biologists preferring a graphical interface.
BioSeq-BLM is implemented in Python (98% of the codebase) and is compatible with Python 3.7 or later, with optional CUDA 10.0 and cuDNN 7.4+ support for GPU acceleration. The 155 BLMs are organized as follows: BGLMs comprise 29 syntax rule-based models and 29 word property-based models that encode sequence composition and physicochemical properties; BSLMs include 12 bag-of-words, 12 TF-IDF, 12 TextRank, and 12 topic models (LSA, PLSA, LDA, Labeled-LDA); BNLMs include 36 word embedding models and 5 automatic feature extraction architectures; and BSSLMs provide 8 models based on pairwise sequence similarity scores. Optionally, the platform integrates with external tools — BLAST for homology search, PSIPRED and SPIDER2 for protein secondary structure, ViennaRNA for RNA folding, and rate4site for evolutionary rate estimation — to augment neural representations with domain-specific biological features.
On benchmark tasks, predictors constructed with BioSeq-BLM matched or exceeded contemporary state-of-the-art methods. For RNA-binding protein identification, the platform achieved 6 to 13 percent AUC improvements over TriPepSVM, RNAPred, and RBPPred. For intrinsically disordered region detection, it improved AUC by 8.7 to 12.6 percent over the best BioSeq-Analysis2.0 predictors. DNA-binding protein prediction reached 81.58% accuracy, exceeding PseDNA-Pro, and microRNA precursor classification matched the performance of iMcRNA.
BioSeq-BLM is designed for researchers building binary or multi-class predictors from raw biological sequences without extensive feature engineering expertise. Typical use cases include functional site identification in DNA (e.g., DNase I hypersensitive sites, transcription factor binding sites), RNA classification tasks (microRNA precursor vs. hairpin discrimination, splicing site prediction), and protein function annotation (DNA-binding protein identification, RNA-binding protein classification, intrinsically disordered region detection). The platform is particularly valuable in scenarios where multiple feature representations need to be compared head-to-head to determine which BLM family best captures the signal for a given target, a step that would otherwise require implementing and validating each representation independently. It integrates naturally into bioinformatics workflows that use FASTA-format sequence inputs and standard classification benchmarking protocols.
BioSeq-BLM contributes a systematized vocabulary for applying language model concepts to biology at a time when the field was rapidly adopting NLP methods but lacked consolidated benchmarking frameworks. By unifying 155 models under a single API, it reduced the barrier to rigorously comparing representation strategies and helped establish the linguistic analogy — treating k-mers as words and sequences as sentences — as a broadly applicable framing for sequence analysis. The platform spawned a successor, BioSeq-Diabolo (2023, PLOS Computational Biology), which focuses specifically on biological sequence similarity analysis and can be chained with BioSeq-BLM in multi-stage pipelines. A notable limitation is that BioSeq-BLM predates the large pretrained protein and DNA language models that emerged from 2022 onward (ESM-2, Nucleotide Transformer, etc.); its neural component covers earlier embedding approaches rather than billion-parameter foundation models. Researchers requiring modern transformer-scale representations may use BioSeq-BLM's statistical and grammar-based features as complementary inputs alongside those newer architectures.
Li, H., et al. (2021) BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Research.
DOI: 10.1093/nar/gkab829