PlasmidLM

Promptable DNA language model that generates multi-kilobase plasmid sequences from plain-language component specs, refined with verifiable rewards.

Released: May 2026

Parameters: 19.3 Million

PlasmidLM is a promptable DNA language model for plasmid design, developed by McClain Thiel and Chris P. Barnes at University College London (UCL) and released as a bioRxiv preprint in May 2026. Plasmids are the workhorse vectors of molecular biology and synthetic biology, encoding origins of replication, selectable markers, promoters, and payloads across multiple kilobases. Designing a functional plasmid traditionally requires manual assembly of validated parts and careful checking of sequence-level constraints, a process PlasmidLM aims to automate by generating complete constructs directly from a natural-language description of the desired components.

The model's central innovation is the application of verifiable-reward post-training to DNA generation. Rather than relying purely on likelihood maximization over a sequence corpus, PlasmidLM refines a pretrained autoregressive base model using Group Relative Policy Optimization (GRPO), where the reward is computed from a curated registry of sequence motifs. This connects recent advances in reinforcement-learning-based alignment of language models to the concrete, checkable requirements of a useful plasmid, such as the presence of a specified resistance gene or reporter.

PlasmidLM builds on PlasmidGPT, a related autoregressive base model pretrained on roughly 153,000 engineered plasmids from Addgene. PlasmidLM inherits that base and adds a reward-driven post-training stage, positioning it alongside other generative genomic models while targeting the specific, practically constrained problem of vector construction. It follows directly from earlier work by the same UCL group (Cunningham et al., bioRxiv, December 2025), which fine-tuned PlasmidGPT on a curated PlasmidScope and Addgene library and—after bioinformatic filtering of 1,000 generations down to 16 candidates—synthesised and experimentally validated three AI-generated E. coli plasmid backbones in vivo, demonstrating growth, antibiotic resistance, and GFP expression. PlasmidLM's verifiable-reward post-training extends that line of work by steering generation toward specified components rather than relying on post-hoc filtering.

Key Features

Promptable design: Accepts human-readable component specifications (e.g., "high-copy E. coli vector with kanamycin resistance and EGFP reporter") and generates corresponding multi-kilobase plasmid sequences.
Verifiable-reward post-training: Uses GRPO with a 660-entry sequence-motif registry as the reward signal, rewarding generated sequences that contain the requested functional elements.
Pretrained autoregressive base: Initialized from PlasmidGPT, trained on approximately 153,000 engineered Addgene plasmids, providing a strong prior over real-world vector architecture.
Compact and reproducible: A 19.3M-parameter model that runs from a fixed checkpoint, making inference lightweight and deterministic relative to much larger genomic models.
Benchmarked success rates: Achieves a 48.5% useful-plasmid rate single-shot and 89.7% with best-of-4 sampling on a held-out benchmark.

Technical Details

PlasmidLM is a 19.3M-parameter autoregressive transformer that operates over DNA sequence. The base model, PlasmidGPT, is pretrained on roughly 153,000 engineered plasmids sourced from Addgene, learning the statistical structure of real vector backbones and payloads. Post-training applies Group Relative Policy Optimization (GRPO), a verifiable-reward reinforcement-learning method, in which candidate sequences are scored against a registry of 660 sequence motifs that encode the requested functional components; sequences satisfying more of the specified constraints receive higher reward. The model is distributed as a fixed checkpoint. On a 1,000-prompt held-out benchmark, PlasmidLM produces a useful plasmid 48.5% of the time in a single shot, rising to 89.7% when the best of four sampled sequences is selected, demonstrating that modest oversampling substantially improves the rate of constraint-satisfying constructs.

Applications

PlasmidLM is aimed at molecular biologists and synthetic biology engineers who need to assemble plasmid vectors from a high-level description rather than manually curating parts. By translating specifications such as copy number, host organism, resistance marker, and reporter into candidate full-length sequences, it can accelerate early-stage construct design, support rapid iteration over vector variants, and lower the expertise barrier for routine cloning workflows. The best-of-N sampling strategy makes it practical to generate several candidates and select one that meets the requested constraints before downstream synthesis and validation.

Impact

PlasmidLM demonstrates that verifiable-reward post-training—an approach popularized for aligning general-purpose language models—can be transferred to genomic sequence generation, where success is defined by checkable biological constraints rather than human preference. By coupling a domain-specific pretrained base (PlasmidGPT) with a motif-based reward, it offers a template for steering generative DNA models toward functional, specification-compliant outputs. As a compact, openly described model with released code and weights, it provides a reproducible starting point for further work on controllable plasmid and vector design. The licensing terms for the released weights were not confirmed at the time of writing.

Citations

PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training

Thiel, M. & Barnes, C. P. (2026) PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training. bioRxiv.

DOI: 10.64898/2026.05.19.725242

Generative design and construction of functional plasmids with a DNA language model

Cunningham, A. G., et al. (2025) Generative design and construction of functional plasmids with a DNA language model. bioRxiv.

DOI: 10.64898/2025.12.06.692736

Recent citations

Papers that recently cited this model.

Not enough citation data yet.

Top citations

The most-cited papers that cite this model.

Not enough citation data yet.

Citations

Total Citations0

Influential0

References37

GitHub

Stars2

Forks1

Open Issues0

Contributors1

Last Push1mo ago

LanguagePython

LicenseMIT

Fields of citing research

Not enough data

Openness

bio.rodeo opennessOpen weights · open weights, closed recipe

49Partial

Usability — can I run it?60

Reproducibility — can I retrain it?27

Model Openness Framework

Unclassified

Missing required components

Resources

GitHub Repository Research Paper Research Paper HuggingFace Model

Key Features

Promptable design: Accepts human-readable component specifications (e.g., "high-copy E. coli vector with kanamycin resistance and EGFP reporter") and generates corresponding multi-kilobase plasmid sequences.

Verifiable-reward post-training: Uses GRPO with a 660-entry sequence-motif registry as the reward signal, rewarding generated sequences that contain the requested functional elements.

Pretrained autoregressive base: Initialized from PlasmidGPT, trained on approximately 153,000 engineered Addgene plasmids, providing a strong prior over real-world vector architecture.

Compact and reproducible: A 19.3M-parameter model that runs from a fixed checkpoint, making inference lightweight and deterministic relative to much larger genomic models.

Benchmarked success rates: Achieves a 48.5% useful-plasmid rate single-shot and 89.7% with best-of-4 sampling on a held-out benchmark.

Technical Details

Applications

Impact

Citations

PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training

Thiel, M. & Barnes, C. P. (2026) PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training. bioRxiv.

DOI: 10.64898/2026.05.19.725242

Generative design and construction of functional plasmids with a DNA language model

Cunningham, A. G., et al. (2025) Generative design and construction of functional plasmids with a DNA language model. bioRxiv.

DOI: 10.64898/2025.12.06.692736

PlasmidLM

Key Features

Technical Details

Applications

Impact

Citations

PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training

Generative design and construction of functional plasmids with a DNA language model

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PlasmidLM

Key Features

Technical Details

Applications

Impact

Citations

PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training

Generative design and construction of functional plasmids with a DNA language model

Recent citations

Top citations

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PlasmidLM

#Key Features

#Technical Details

#Applications

#Impact

Citations

PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training

Generative design and construction of functional plasmids with a DNA language model

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

PlasmidLM

#Key Features

#Technical Details

#Applications

#Impact

Citations

PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training

Generative design and construction of functional plasmids with a DNA language model

Recent citations

Top citations

Related models

Citations

GitHub

Fields of citing research

Openness

Tags

Resources

Key Features

Technical Details

Applications

Impact

Key Features

Technical Details

Applications

Impact