bio.rodeo
ModelsOrganizationsLeaderboardAbout
bio.rodeo

The authoritative source for evaluating biological foundation models. No hype, just honest analysis.

AboutFAQSubmit a modelContact
© 2026 Pulsatance. All rights reserved. ~
Built by Pulsatance
RNAProteinDNA & Gene

MIMIC

Polymathic AI

Generative multimodal foundation model that jointly models DNA, RNA, protein, and cellular context across six biological modalities, with SOTA splicing prediction.

Released: April 2026
Parameters: 1 Billion

MIMIC is a generative multimodal foundation model that jointly represents the molecules of the central dogma—DNA, RNA, and protein—together with the regulatory, evolutionary, structural, and contextual signals that constrain them. Most biological foundation models are trained on a single modality (a protein language model, a genomic sequence model, an RNA structure predictor) and therefore cannot reason about how a coding change propagates into altered splicing, structure, or function. MIMIC instead conditions on arbitrary subsets of observed modalities and reconstructs or generates the missing components of a molecular state, allowing any-to-any inference across the genome, transcriptome, and proteome.

The model was developed by Polymathic AI, a research collaboration based at the Flatiron Institute, with computation supported by the Simons Foundation and Schmidt Sciences' AI2050 program. It was released as an arXiv preprint in April 2026 by Siavash Golkar, Shirley Ho, and roughly 30 collaborators. MIMIC is positioned as a unifying alternative to the fragmented landscape of modality-specific biological models, demonstrating that cross-modal supervision improves performance over sequence-only training.

A central claim of the work is that coupled constraints across sequence, structure, regulation, evolution, and cellular context are best learned jointly. By aligning these modalities during pretraining, MIMIC produces representations that transfer to RNA and protein downstream tasks and that enable constrained generative design rather than prediction alone.

#Key Features

  • Multimodal conditioning: MIMIC conditions on any subset of observed modalities—sequence, splice junctions, conservation, chromatin accessibility, RNA chemical probing, protein structure, abundance, and functional text—and reconstructs the rest, consistently improving over sequence-only inputs.
  • State-of-the-art splicing prediction: On held-out human transcripts the model outperforms SpliceAI and AlphaGenome on splice-site prediction, and its isoform-aware generative formulation uniquely supports inverse design of desired splice patterns.
  • Generative design: MIMIC produces diverse, high-confidence protein sequences with strong support for target binding and supports constrained tasks such as RNA editing identification and binding-site optimization.
  • Context-dependent modeling: It models assay-dependent RNA chemical probing by conditioning on experimental context, capturing how measurements vary across conditions.
  • Transferable representations: Learned features rank first or near-first on 7/11 PFMBench protein tasks and match or exceed baselines on 6/7 mRNABench RNA tasks.

#Technical Details

MIMIC is a roughly one-billion-parameter split-track encoder-decoder transformer. Inputs are organized into distinct track groups by biological coordinate system rather than concatenated, with localized RoPE position indices that reset at track boundaries and learnable register tokens that aggregate context across tracks. Training uses around 25 distinct pathways to ensure rare modality combinations are represented, along with a staged curriculum that scales context windows from 1,000 to 10,000 tokens. The model is trained on LORE, a newly curated cross-modal dataset linking nucleic-acid, protein, evolutionary, structural, regulatory, and semantic modalities—comprising 13 million RNA transcripts, 15.5 million proteins, over 4 billion natural-language tokens, and more than 6,000 organisms.

#Applications

MIMIC targets researchers studying gene regulation, RNA biology, and protein function who need a single model spanning the central dogma. Demonstrated use cases include identifying RNA editing in clinically relevant mutations using evolutionary and structural signals, designing proteins with multimodal conditioning for target binding, predicting and inversely designing splice patterns, and modeling experimental-context-dependent RNA reactivity. Because it conditions on partial observations, it fits naturally into workflows where some modalities are measured and others must be inferred or designed.

#Impact

MIMIC offers early evidence that jointly modeling the central dogma yields better downstream performance and richer generative capabilities than modality-specific models, particularly for splicing where it surpasses established baselines such as SpliceAI and AlphaGenome. As a unifying any-to-any framework it suggests a path toward integrated biological foundation models. A key limitation at release is availability: the authors state that code, weights, and LORE assets are in preparation for public release on the Polymathic AI GitHub but are not yet downloadable, and results are reported in a non-peer-reviewed preprint, so independent reproduction remains pending.

Tags

structure_predictionprotein_designvariant_effect_predictiontransformerfoundation_modelmultimodalgenerativesplicinggenomics