Overview

Galactica is a large language model developed by Meta AI and Papers with Code, designed specifically to store, combine, and reason about scientific knowledge. Released on November 16, 2022, the model was trained on a curated corpus of 48 million scientific papers, textbooks, reference materials, lecture notes, encyclopedias, and molecular and protein databases — making it one of the first foundation models to target the scientific domain at scale.

The motivation behind Galactica was the growing difficulty researchers face in navigating an exponentially expanding body of scientific literature. Rather than a general-purpose language model, Galactica was conceived as a scientific interface: a system capable of summarizing papers, solving mathematical problems, annotating molecules and proteins, generating scientific code, and synthesizing knowledge across disciplines within a single unified model.

Galactica was released in five sizes ranging from 125 million to 120 billion parameters. Its public demo was withdrawn just three days after launch amid widespread criticism over hallucinated content — including fabricated citations attributed to real authors — highlighting fundamental challenges in applying large language models to domains where factual accuracy is non-negotiable. Despite this, the model weights and code were made publicly available and the work has influenced subsequent scientific AI development.

Key Features

Scientific training corpus: Trained on 48 million curated documents spanning peer-reviewed papers, textbooks, lecture notes, encyclopedias, chemical compound databases, and protein sequence repositories, giving the model broad coverage of structured scientific knowledge.
Multi-task scientific reasoning: A single model handles diverse tasks including academic paper summarization, LaTeX equation solving, SMILES-based molecule annotation, protein sequence analysis, Wikipedia-style article generation, and scientific code generation.
Specialized tokenization: Custom tokenization handles scientific notation, mathematical expressions in LaTeX, chemical formulas in SMILES format, protein sequences, and inline citations — encoding domain-specific structure that standard tokenizers discard.
Scalable model family: Released in five sizes (125M, 1.3B, 6.7B, 30B, and 120B parameters), enabling deployment across a wide range of compute environments from local inference to large-scale research clusters.
Benchmark performance on technical tasks: Achieved 68.2% accuracy on LaTeX equation probes (versus GPT-3's 49.0%), 41.3% on mathematical MMLU (versus Chinchilla's 35.7%), and 77.6% on PubMedQA.

Technical Details

Galactica uses a decoder-only transformer architecture closely related to GPT-3, trained with a context window of 2,048 tokens. The model was trained on approximately 106 billion tokens drawn from the scientific corpus, with data weighted to favor high-quality, peer-reviewed sources. Scientific notation, mathematical content, and structured sequences are handled through specialized tokens and formatting conventions baked into both the training data and the tokenizer. Unlike general-purpose language models trained on web-crawled text, Galactica's training data was deliberately curated to exclude noisy sources in favor of structured scientific content.

The 120B parameter flagship model was the largest in the family and the subject of most benchmark evaluations reported in the paper. Training was conducted on A100 GPUs using Meta's research infrastructure. The model achieves state-of-the-art or competitive results across a number of scientific reasoning benchmarks: 20.4% on MATH (versus PaLM 540B's 8.8%), 52.9% on MedMCQA, and strong performance on chemical and biological annotation tasks.

Applications

Galactica is applicable to any workflow requiring synthesis or generation of scientific text. Computational biology researchers can use it to annotate protein sequences or chemical compounds, query biological knowledge, or generate summaries of molecular function. It is also suited to mathematical and theoretical work, providing step-by-step solutions in LaTeX. Science writers and educators can use it to draft accessible explanations or wiki-style overviews of technical topics. The open-source model weights allow integration into custom pipelines for literature mining, automated hypothesis generation, and scientific information retrieval systems, particularly in resource-rich research environments.

Impact

Galactica demonstrated that training language models on curated scientific corpora rather than indiscriminate web text could yield measurable improvements on domain-specific benchmarks. It helped establish the scientific large language model as a distinct research direction, informing subsequent work on models such as BioMedLM, Minerva, and other scientifically focused foundation models. However, the swift public backlash following its demo launch remains a cautionary data point in responsible AI deployment: even with high-quality training data, decoder-only autoregressive models can produce confident and fluent but factually incorrect scientific content. This limitation remains an open problem for the field. The galai Python library and model weights hosted on HuggingFace have enabled ongoing academic research into scientific AI, even as the broader community has grappled with the reliability challenges Galactica's launch made visible.

Overview

Key Features

Scientific training corpus: Trained on 48 million curated documents spanning peer-reviewed papers, textbooks, lecture notes, encyclopedias, chemical compound databases, and protein sequence repositories, giving the model broad coverage of structured scientific knowledge.

Multi-task scientific reasoning: A single model handles diverse tasks including academic paper summarization, LaTeX equation solving, SMILES-based molecule annotation, protein sequence analysis, Wikipedia-style article generation, and scientific code generation.

Specialized tokenization: Custom tokenization handles scientific notation, mathematical expressions in LaTeX, chemical formulas in SMILES format, protein sequences, and inline citations — encoding domain-specific structure that standard tokenizers discard.

Scalable model family: Released in five sizes (125M, 1.3B, 6.7B, 30B, and 120B parameters), enabling deployment across a wide range of compute environments from local inference to large-scale research clusters.

Benchmark performance on technical tasks: Achieved 68.2% accuracy on LaTeX equation probes (versus GPT-3's 49.0%), 41.3% on mathematical MMLU (versus Chinchilla's 35.7%), and 77.6% on PubMedQA.

Technical Details

Applications

Impact

Galactica

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

HuggingFace

Tags

Resources

Galactica

Overview

Key Features

Technical Details

Applications

Impact

Citation

Metrics

GitHub

HuggingFace

Tags

Resources