bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
Language model foundation models
Language model

Galactica

Meta AI

A large language model trained on 48 million scientific papers and knowledge bases to store, combine, and reason about scientific knowledge.

Released: November 2022
Parameters: 120 Billion

Galactica is a large language model developed by Meta AI and Papers with Code, designed specifically to store, combine, and reason about scientific knowledge. Released on November 16, 2022, the model was trained on a curated corpus of 48 million scientific papers, textbooks, reference materials, lecture notes, encyclopedias, and molecular and protein databases — making it one of the first foundation models to target the scientific domain at scale.

The motivation behind Galactica was the growing difficulty researchers face in navigating an exponentially expanding body of scientific literature. Rather than a general-purpose language model, Galactica was conceived as a scientific interface: a system capable of summarizing papers, solving mathematical problems, annotating molecules and proteins, generating scientific code, and synthesizing knowledge across disciplines within a single unified model.

Galactica was released in five sizes ranging from 125 million to 120 billion parameters. Its public demo was withdrawn just three days after launch amid widespread criticism over hallucinated content — including fabricated citations attributed to real authors — highlighting fundamental challenges in applying large language models to domains where factual accuracy is non-negotiable. Despite this, the model weights and code were made publicly available and the work has influenced subsequent scientific AI development.

#Key Features

  • Scientific training corpus: Trained on 48 million curated documents spanning peer-reviewed papers, textbooks, lecture notes, encyclopedias, chemical compound databases, and protein sequence repositories, giving the model broad coverage of structured scientific knowledge.
  • Multi-task scientific reasoning: A single model handles diverse tasks including academic paper summarization, LaTeX equation solving, SMILES-based molecule annotation, protein sequence analysis, Wikipedia-style article generation, and scientific code generation.
  • Specialized tokenization: Custom tokenization handles scientific notation, mathematical expressions in LaTeX, chemical formulas in SMILES format, protein sequences, and inline citations — encoding domain-specific structure that standard tokenizers discard.
  • Scalable model family: Released in five sizes (125M, 1.3B, 6.7B, 30B, and 120B parameters), enabling deployment across a wide range of compute environments from local inference to large-scale research clusters.
  • Benchmark performance on technical tasks: Achieved 68.2% accuracy on LaTeX equation probes (versus GPT-3's 49.0%), 41.3% on mathematical MMLU (versus Chinchilla's 35.7%), and 77.6% on PubMedQA.

#Technical Details

Galactica uses a decoder-only transformer architecture closely related to GPT-3, trained with a context window of 2,048 tokens. The model was trained on approximately 106 billion tokens drawn from the scientific corpus, with data weighted to favor high-quality, peer-reviewed sources. Scientific notation, mathematical content, and structured sequences are handled through specialized tokens and formatting conventions baked into both the training data and the tokenizer. Unlike general-purpose language models trained on web-crawled text, Galactica's training data was deliberately curated to exclude noisy sources in favor of structured scientific content.

The 120B parameter flagship model was the largest in the family and the subject of most benchmark evaluations reported in the paper. Training was conducted on A100 GPUs using Meta's research infrastructure. The model achieves state-of-the-art or competitive results across a number of scientific reasoning benchmarks: 20.4% on MATH (versus PaLM 540B's 8.8%), 52.9% on MedMCQA, and strong performance on chemical and biological annotation tasks.

#Applications

Galactica is applicable to any workflow requiring synthesis or generation of scientific text. Computational biology researchers can use it to annotate protein sequences or chemical compounds, query biological knowledge, or generate summaries of molecular function. It is also suited to mathematical and theoretical work, providing step-by-step solutions in LaTeX. Science writers and educators can use it to draft accessible explanations or wiki-style overviews of technical topics. The open-source model weights allow integration into custom pipelines for literature mining, automated hypothesis generation, and scientific information retrieval systems, particularly in resource-rich research environments.

#Impact

Galactica demonstrated that training language models on curated scientific corpora rather than indiscriminate web text could yield measurable improvements on domain-specific benchmarks. It helped establish the scientific large language model as a distinct research direction, informing subsequent work on models such as BioMedLM, Minerva, and other scientifically focused foundation models. However, the swift public backlash following its demo launch remains a cautionary data point in responsible AI deployment: even with high-quality training data, decoder-only autoregressive models can produce confident and fluent but factually incorrect scientific content. This limitation remains an open problem for the field. The galai Python library and model weights hosted on HuggingFace have enabled ongoing academic research into scientific AI, even as the broader community has grappled with the reliability challenges Galactica's launch made visible.

Citation

Galactica: A Large Language Model for Science

Preprint

Taylor, R., et al. (2022) Galactica: A Large Language Model for Science. arXiv.

DOI: 10.48550/arXiv.2211.09085

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations1K
Influential94
References107

GitHub

Stars2.7K
Forks264
Open Issues30
Contributors5
Last Push3y ago
LanguageJupyter Notebook
LicenseApache-2.0

HuggingFace

Downloads1.4K
Likes157
Last Modified3y ago
Pipelinetext-generation

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe
46Partial
Usability — can I run it?72
Reproducibility — can I retrain it?7
open weights, closed recipe
Model Openness Framework
Unclassified
Restrictive license on core components

Tags

foundation_modellanguage_modelmultimodal

Resources

GitHub RepositoryResearch PaperHuggingFace ModelDocumentation