MIMIC

Generative multimodal foundation model spanning DNA, RNA, and protein, with any-to-any inference across genome, transcriptome, and proteome.

Released: April 2026

Parameters: 1 Billion

MIMIC is a generative multimodal foundation model that jointly represents the molecules of the central dogma—DNA, RNA, and protein—together with the regulatory, evolutionary, structural, and contextual signals that constrain them. Most biological foundation models are trained on a single modality (a protein language model, a genomic sequence model, an RNA structure predictor) and therefore cannot reason about how a coding change propagates into altered splicing, structure, or function. MIMIC instead conditions on arbitrary subsets of observed modalities and reconstructs or generates the missing components of a molecular state, allowing any-to-any inference across the genome, transcriptome, and proteome.

The model was developed by Polymathic AI, a research collaboration based at the Flatiron Institute, with computation supported by the Simons Foundation and Schmidt Sciences' AI2050 program. It was released as an arXiv preprint in April 2026 by Siavash Golkar, Shirley Ho, and roughly 30 collaborators. MIMIC is positioned as a unifying alternative to the fragmented landscape of modality-specific biological models, demonstrating that cross-modal supervision improves performance over sequence-only training.

A central claim of the work is that coupled constraints across sequence, structure, regulation, evolution, and cellular context are best learned jointly. By aligning these modalities during pretraining, MIMIC produces representations that transfer to RNA and protein downstream tasks and that enable constrained generative design rather than prediction alone.

Key Features

Multimodal conditioning: MIMIC conditions on any subset of observed modalities—sequence, splice junctions, conservation, chromatin accessibility, RNA chemical probing, protein structure, abundance, and functional text—and reconstructs the rest, consistently improving over sequence-only inputs.
State-of-the-art splicing prediction: On held-out human transcripts the model outperforms SpliceAI and AlphaGenome on splice-site prediction, and its isoform-aware generative formulation uniquely supports inverse design of desired splice patterns.
Generative design: MIMIC produces diverse, high-confidence protein sequences with strong support for target binding and supports constrained tasks such as RNA editing identification and binding-site optimization.
Context-dependent modeling: It models assay-dependent RNA chemical probing by conditioning on experimental context, capturing how measurements vary across conditions.
Transferable representations: Learned features rank first or near-first on 7/11 PFMBench protein tasks and match or exceed baselines on 6/7 mRNABench RNA tasks.

Technical Details

MIMIC is a roughly one-billion-parameter split-track encoder-decoder transformer. Inputs are organized into distinct track groups by biological coordinate system rather than concatenated, with localized RoPE position indices that reset at track boundaries and learnable register tokens that aggregate context across tracks. Training uses around 25 distinct pathways to ensure rare modality combinations are represented, along with a staged curriculum that scales context windows from 1,000 to 10,000 tokens. The model is trained on LORE, a newly curated cross-modal dataset linking nucleic-acid, protein, evolutionary, structural, regulatory, and semantic modalities—comprising 13 million RNA transcripts, 15.5 million proteins, over 4 billion natural-language tokens, and more than 6,000 organisms.

Applications

MIMIC targets researchers studying gene regulation, RNA biology, and protein function who need a single model spanning the central dogma. Demonstrated use cases include identifying RNA editing in clinically relevant mutations using evolutionary and structural signals, designing proteins with multimodal conditioning for target binding, predicting and inversely designing splice patterns, and modeling experimental-context-dependent RNA reactivity. Because it conditions on partial observations, it fits naturally into workflows where some modalities are measured and others must be inferred or designed.

Impact

MIMIC offers early evidence that jointly modeling the central dogma yields better downstream performance and richer generative capabilities than modality-specific models, particularly for splicing where it surpasses established baselines such as SpliceAI and AlphaGenome. As a unifying any-to-any framework it suggests a path toward integrated biological foundation models. A key limitation at release is availability: the authors state that code, weights, and LORE assets are in preparation for public release on the Polymathic AI GitHub but are not yet downloadable, and results are reported in a non-peer-reviewed preprint, so independent reproduction remains pending.

Citation

MIMIC: A Generative Multimodal Foundation Model for Biomolecules

Preprint

Golkar, S., et al. (2026) MIMIC: A Generative Multimodal Foundation Model for Biomolecules.

DOI: 10.48550/arXiv.2604.24506

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations25

Influential2

References99

GitHub

Stars36

Forks1

Open Issues0

Contributors1

Last Push3d ago

LanguagePython

LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessClosed · low usability and reproducibility

16Closed

Usability — can I run it?15

Reproducibility — can I retrain it?0

not reproducible

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Official Website

Key Features

Multimodal conditioning: MIMIC conditions on any subset of observed modalities—sequence, splice junctions, conservation, chromatin accessibility, RNA chemical probing, protein structure, abundance, and functional text—and reconstructs the rest, consistently improving over sequence-only inputs.

State-of-the-art splicing prediction: On held-out human transcripts the model outperforms SpliceAI and AlphaGenome on splice-site prediction, and its isoform-aware generative formulation uniquely supports inverse design of desired splice patterns.

Generative design: MIMIC produces diverse, high-confidence protein sequences with strong support for target binding and supports constrained tasks such as RNA editing identification and binding-site optimization.

Context-dependent modeling: It models assay-dependent RNA chemical probing by conditioning on experimental context, capturing how measurements vary across conditions.

Transferable representations: Learned features rank first or near-first on 7/11 PFMBench protein tasks and match or exceed baselines on 6/7 mRNABench RNA tasks.

Technical Details

Applications

Impact

MIMIC

Key Features

Technical Details

Applications

Impact

Citation

MIMIC: A Generative Multimodal Foundation Model for Biomolecules

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

MIMIC

Key Features

Technical Details

Applications

Impact

Citation

MIMIC: A Generative Multimodal Foundation Model for Biomolecules

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

MIMIC

#Key Features

#Technical Details

#Applications

#Impact

Citation

MIMIC: A Generative Multimodal Foundation Model for Biomolecules

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

MIMIC

#Key Features

#Technical Details

#Applications

#Impact

Citation

MIMIC: A Generative Multimodal Foundation Model for Biomolecules

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact