bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Language model foundation models
Language modelProtein

BioSeq-BLM

Beijing Institute of Technology

An integrated platform implementing 155 biological language models for analyzing DNA, RNA, and protein sequences across residue-level and sequence-level tasks.

Released: December 2021

BioSeq-BLM is a unified computational platform developed by Hong-Liang Li, Yi-He Pang, and Bin Liu at the Beijing Institute of Technology's School of Computer Science and Technology. Published in Nucleic Acids Research in December 2021, the platform addresses a fundamental challenge in computational biology: the lack of a systematic, integrated framework for applying the full spectrum of natural language processing techniques to biological sequence analysis. Prior to BioSeq-BLM, researchers working on DNA, RNA, or protein classification tasks faced a fragmented landscape of tools, each supporting only a narrow slice of available feature representations and machine learning methods.

The platform treats biological sequences as natural language and organizes 155 biological language models (BLMs) into four complementary families: Biological Grammar Language Models (BGLMs), Biological Statistical Language Models (BSLMs), Biological Neural Language Models (BNLMs), and Biological Semantic Similarity Language Models (BSSLMs). Each family captures distinct properties of sequence data, from syntactic k-mer rules and word co-occurrence statistics to deep neural embeddings and sequence similarity scores. This comprehensive taxonomy allows researchers to benchmark representations systematically and select the most appropriate model for a given biological prediction task without switching between disparate software environments.

BioSeq-BLM extends the earlier BioSeq-Analysis2.0 platform from the same group, nearly doubling the number of supported grammar models, introducing statistical and neural language model families absent from the predecessor, and adding GPU-accelerated deep learning classifiers. A web server at bliulab.net/BioSeq-BLM and a downloadable standalone package make the platform accessible to both bioinformaticians comfortable with command-line tools and biologists preferring a graphical interface.

#Key Features

  • 155 integrated biological language models: The platform spans four model families — 58 grammar-based, 48 statistical, 41 neural (including word2vec, GloVe, fastText, and transformer variants), and 8 semantic similarity models — providing the most comprehensive BLM collection available in a single tool as of its publication.
  • Unified multi-molecule support: A single workflow handles DNA, RNA, and protein sequences, with analysis at both the residue level (per-position labeling) and the sequence level (whole-sequence classification), removing the need for molecule-specific tools.
  • Integrated machine learning pipeline: Built-in classifiers include SVM, random forest, CNN, LSTM, GRU, and transformer architectures. Feature selection (chi-square, mutual information, tree-based), dimensionality reduction (PCA, kernel PCA, TSVD), and imbalanced-dataset correction (SMOTE, Tomek links) are all available within the same pipeline.
  • Comprehensive evaluation framework: Supports nine binary classification metrics (accuracy, MCC, AUC, sensitivity, specificity, precision, F-measure, balanced accuracy, G-mean) plus multi-class accuracy, with cross-validation and independent test set evaluation.
  • GPU acceleration and multithreading: The standalone package supports CUDA-enabled GPU inference and parallel processing, enabling batch analysis of large sequence datasets that would be impractical on a web server.
  • Web server and standalone package: Users can access a point-and-click web interface for exploratory work or download the full Python package for reproducible, automated pipelines with command-line control.

#Technical Details

BioSeq-BLM is implemented in Python (98% of the codebase) and is compatible with Python 3.7 or later, with optional CUDA 10.0 and cuDNN 7.4+ support for GPU acceleration. The 155 BLMs are organized as follows: BGLMs comprise 29 syntax rule-based models and 29 word property-based models that encode sequence composition and physicochemical properties; BSLMs include 12 bag-of-words, 12 TF-IDF, 12 TextRank, and 12 topic models (LSA, PLSA, LDA, Labeled-LDA); BNLMs include 36 word embedding models and 5 automatic feature extraction architectures; and BSSLMs provide 8 models based on pairwise sequence similarity scores. Optionally, the platform integrates with external tools — BLAST for homology search, PSIPRED and SPIDER2 for protein secondary structure, ViennaRNA for RNA folding, and rate4site for evolutionary rate estimation — to augment neural representations with domain-specific biological features.

On benchmark tasks, predictors constructed with BioSeq-BLM matched or exceeded contemporary state-of-the-art methods. For RNA-binding protein identification, the platform achieved 6 to 13 percent AUC improvements over TriPepSVM, RNAPred, and RBPPred. For intrinsically disordered region detection, it improved AUC by 8.7 to 12.6 percent over the best BioSeq-Analysis2.0 predictors. DNA-binding protein prediction reached 81.58% accuracy, exceeding PseDNA-Pro, and microRNA precursor classification matched the performance of iMcRNA.

#Applications

BioSeq-BLM is designed for researchers building binary or multi-class predictors from raw biological sequences without extensive feature engineering expertise. Typical use cases include functional site identification in DNA (e.g., DNase I hypersensitive sites, transcription factor binding sites), RNA classification tasks (microRNA precursor vs. hairpin discrimination, splicing site prediction), and protein function annotation (DNA-binding protein identification, RNA-binding protein classification, intrinsically disordered region detection). The platform is particularly valuable in scenarios where multiple feature representations need to be compared head-to-head to determine which BLM family best captures the signal for a given target, a step that would otherwise require implementing and validating each representation independently. It integrates naturally into bioinformatics workflows that use FASTA-format sequence inputs and standard classification benchmarking protocols.

#Impact

BioSeq-BLM contributes a systematized vocabulary for applying language model concepts to biology at a time when the field was rapidly adopting NLP methods but lacked consolidated benchmarking frameworks. By unifying 155 models under a single API, it reduced the barrier to rigorously comparing representation strategies and helped establish the linguistic analogy — treating k-mers as words and sequences as sentences — as a broadly applicable framing for sequence analysis. The platform spawned a successor, BioSeq-Diabolo (2023, PLOS Computational Biology), which focuses specifically on biological sequence similarity analysis and can be chained with BioSeq-BLM in multi-stage pipelines. A notable limitation is that BioSeq-BLM predates the large pretrained protein and DNA language models that emerged from 2022 onward (ESM-2, Nucleotide Transformer, etc.); its neural component covers earlier embedding approaches rather than billion-parameter foundation models. Researchers requiring modern transformer-scale representations may use BioSeq-BLM's statistical and grammar-based features as complementary inputs alongside those newer architectures.

Citation

BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models

Li, H., et al. (2021) BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models. Nucleic Acids Research.

DOI: 10.1093/nar/gkab829

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations221
Influential3
References108

GitHub

Stars14
Forks4
Open Issues3
Contributors1
Last Push3y ago
LanguagePython
LicenseBSD-2-Clause

Fields of citing research

Not enough data

Openness

bio.rodeo opennessFully open · usable and reproducible
48Partial
Usability — can I run it?56
Reproducibility — can I retrain it?54
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

dnafoundation_modelsequence_analysis

Resources

GitHub RepositoryResearch PaperOfficial WebsiteDocumentationDataset