Shanghai Jiao Tong University / Microsoft
A multi-task EEG foundation model that treats brain signals as a foreign language, pairing a text-aligned neural tokenizer with a GPT-2 backbone.
NeuroLM is a foundation model for electroencephalography (EEG) that reframes brain-signal analysis as a language-modeling problem: raw EEG is converted into discrete tokens and processed by a large language model, allowing a single network to handle many decoding tasks without a separate classifier head per dataset. It was introduced by Wei-Bang Jiang and Bao-Liang Lu at Shanghai Jiao Tong University together with Yansen Wang and Dongsheng Li at Microsoft Research Asia, first posted to arXiv in August 2024 and accepted at ICLR 2025.
Most prior EEG deep-learning models are trained and evaluated on one task at a time, so a model built for sleep staging cannot also flag abnormal recordings or classify emotion without retraining. NeuroLM instead treats EEG as a "foreign language" that is aligned to text, then uses multi-task instruction tuning to teach one model to follow natural-language prompts across heterogeneous EEG benchmarks. To the authors' knowledge it is the first multi-task EEG foundation model, and it builds directly on the team's earlier neural tokenizer work, LaBraM.
The result is a unified system that ingests multi-channel EEG and emits predictions for tasks as different as abnormality detection, event classification, and emotion recognition, all driven by the same instruction-tuned backbone rather than task-specific fine-tuning pipelines.
NeuroLM couples a VQ neural tokenizer with a GPT-2 language-model backbone. The tokenizer is trained to reconstruct both the temporal and frequency domains of EEG while adversarial domain classifiers align its codebook with a text embedding space; the LLM is then trained with multi-channel autoregressive objectives and adapted via instruction tuning. Pretraining used roughly 25,000 hours of EEG, dominated by the Temple University EEG corpus (~24,000 hours) and supplemented by datasets such as SEED and BCI Competition IV. The largest variant, NeuroLM-XL, has about 1.696 billion parameters. Evaluation spans six benchmarks: TUAB (abnormal detection), TUEV (event classification), SEED (emotion recognition), HMC (sleep staging), Workload (cognitive load), and TUSL (slowing events). A single instruction-tuned NeuroLM matches the broad capabilities of separately trained baselines, though on individual tasks dedicated models can still edge it out: on TUAB, NeuroLM-XL reaches a balanced accuracy of 0.797 versus 0.814 for the single-task LaBraM-Base.
NeuroLM targets clinical and neuroscience workflows where a lab must run many different EEG analyses—screening recordings for abnormality, classifying epileptiform or slowing events, staging sleep, gauging cognitive workload, or decoding affective state—from a shared backbone rather than maintaining a fleet of task-specific models. By exposing tasks as natural-language instructions, it lowers the engineering burden of adding a new EEG task and makes it feasible to deploy one model across the diagnostic and brain-computer-interface settings that previously each required bespoke training.
NeuroLM is notable as the first EEG foundation model to unify multiple decoding tasks under a single instruction-following LLM, demonstrating that the tokenize-then-language-model recipe that reshaped protein and genomic modeling can extend to neural time series. Its ICLR 2025 acceptance, open MIT-licensed code, and released checkpoints have made it a reference point for "EEG-as-language" research and multi-task biosignal foundation models. The main limitation is that per-task accuracy does not yet consistently surpass strong single-task specialists such as LaBraM, leaving headroom for better tokenization and alignment in future work.
Jiang, W., et al. (2024) NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals. International Conference on Learning Representations.
DOI: 10.48550/arXiv.2409.00101Papers that recently cited this model.
The most-cited papers that cite this model.
Not enough data