All Competitors

mLLMCelltype

Texas A&M University

Multi-LLM consensus framework for automated cell type annotation in scRNA-seq data, outperforming prior methods by ~15% in mean accuracy.

6407

Pinal

Westlake University

A 16B-parameter framework for de novo protein design from natural language, converting text descriptions into functional protein sequences via two-stage structure-conditioned generation.

9321

Evolla

Westlake University

An 80B-parameter multimodal protein-language model that decodes protein function through natural language dialogue, integrating sequence, structure, and evolutionary context.

671941

RhoFold+

ml4bio

End-to-end RNA 3D structure prediction combining the RNA-FM language model with Invariant Point Attention, achieving SOTA on RNA-Puzzles and CASP15.

227186

Cell2Sentence

Yale University

Framework that converts single-cell gene expression profiles into ranked gene-name sequences, enabling standard LLMs to generate, annotate, and analyze cells.

85466

Compute-Optimal PLM

BioMap

Scaling law study for protein language models that identifies compute-optimal training regimes for CLM and MLM architectures using 939M protein sequences.

1134

CellPLM

OmicsML

Single-cell transformer that treats cells as tokens and tissues as sentences, encoding cell-cell relationships with 100x faster inference than prior pre-trained models.

10274

PLMSearch

Fudan University

Protein language model-based sequence search that detects remote homologs with threefold higher sensitivity than MMseqs2 at comparable speed.

7967

GPTCelltype

Columbia University / Duke University

An R package that uses GPT-4 to annotate cell types in scRNA-seq data from marker genes, matching expert accuracy across hundreds of cell types and tissues.

224202

ERNIE-RNA

Tsinghua University

A structure-enhanced RNA language model that incorporates base-pairing constraints into self-attention, achieving state-of-the-art RNA structure and function prediction.

4127

DNA & Gene

Caduceus

Kuleshov Lab

Bidirectional, reverse-complement equivariant DNA language models built on Mamba SSMs. Outperforms models 10x larger on long-range variant effect prediction.

232

RiNALMo

LBCB Sci

650M-parameter RNA language model pre-trained on 36M non-coding RNA sequences. Achieves state-of-the-art generalization on secondary structure prediction across unseen RNA families.

161100

ProLLaMA

PKU-YuanGroup

A 7B-parameter protein language model built on LLaMA-2 that performs both protein sequence generation and superfamily classification in a unified framework.

2105K

scMulan

Tsinghua University

A 368M-parameter generative language model for single-cell transcriptomics, enabling zero-shot cell type annotation, batch integration, and conditional cell generation.

617

RNA-MSM

Peking University / Griffith University

Unsupervised RNA language model using multiple sequence alignments to predict secondary structure and solvent accessibility from evolutionary information.

69941K

IgLM

GrayLab

Generative language model trained on 558 million antibody sequences for infilling-based design of CDR loops and full-length immunoglobulin sequences.

188104

ProGen2

Salesforce

Family of autoregressive protein language models (151M–6.4B parameters) trained on over a billion sequences for protein generation and zero-shot fitness prediction.

699

Multimodalities

BioT5

Renmin University of China

Pre-training framework bridging molecules, proteins, and natural language using T5 with SELFIES representations for cross-modal biological understanding.

125

Multimodalities

DARWIN Series

MasterAI EAM

Domain-specific large language models for natural science, fine-tuned on physics, chemistry, and materials science literature using automated instruction generation.

24748

DNA & Gene

DNABERT-2

MAGICS Lab

Multi-species genomic foundation model replacing k-mer tokenization with BPE, achieving state-of-the-art performance with 21x fewer parameters than prior leading models.

47837595.5K

Ankh

Technical University of Munich

Optimized protein language model that surpasses state-of-the-art performance with fewer than 10% of the parameters of comparable models.

24656

ReprogBERT

IBM

Reprograms a frozen English BERT model for antibody CDR sequence infilling via learnable cross-domain projection matrices, without training a new protein language model.

2411

SpliceBERT

Biomed AI

A BERT-based RNA language model pre-trained on 2M+ pre-mRNA sequences from 72 vertebrate species for splicing prediction and variant effect analysis.

548

Multimodalities

Galactica

Meta AI

A large language model trained on 48 million scientific papers and knowledge bases to store, combine, and reason about scientific knowledge.

2.7K3.5K