There’s a lot of excitement these days about large language models (LLMs) and the new opportunities—and risks—they bring. As someone who’s spent a fair bit of time in the NLP and LLM trenches, I’ve seen firsthand how creative people can get when trying to “puppeteer” these models into doing things they shouldn’t.

Recently, I built a small tool called Puppetry Detector. It’s not a silver bullet, and it’s certainly not the only approach out there, but I think it fills a useful niche: detecting policy puppetry and prompt injection attempts using nothing more than plain Python and regular expressions.

Why Not Just Use an LLM to Catch LLM Attacks?

It’s tempting to throw a big model at every problem. But sometimes, the simplest solution is the best place to start. Regex patterns are:

Fast (milliseconds, not seconds)
Transparent (you can see exactly what’s being matched)
Easy to audit and extend (no black boxes)
Lightweight (no GPU, no cloud, no fuss)

I wanted something that could be dropped into a pipeline, run on a Raspberry Pi, or be used as a building block for more complex systems. If you need more power, you can always layer on spaCy, or even call out to an LLM for a second opinion. But for a lot of cases, simple pattern matching gets you surprisingly far.

How It Works

Puppetry Detector scans prompts for patterns that look like attempts to override roles, bypass security, or otherwise manipulate the model’s behavior. It’s inspired by research from HiddenLayer and others, and is designed to be:

Configurable: You can add or tweak patterns as new attacks emerge.
Composable: Use it on its own, or as part of a larger moderation pipeline.

Integration with Rebuff

One of the things I appreciate about the Python ecosystem is how easy it is to make tools work together. Puppetry Detector comes with a simple adapter for Rebuff, a prompt moderation framework. This means you can use it as a filter in your Rebuff pipeline, combining it with other detectors for layered defense.

What’s Next?

This project is a starting point, not an endpoint. There’s plenty of room to grow:

Smarter Patterns: As attackers get more creative, so can our regexes.
Deeper Analysis: Plug in spaCy or other NLP tools for more nuanced checks.
LLM-Assisted Moderation: For the really tricky cases, call out to a model to estimate intent or even spot subtle lies.

I’m sharing this not because I think it’s perfect, but because I believe in the value of small, understandable tools. If you’re working in LLM safety, prompt engineering, or just want to see how far you can get with a few lines of Python, I hope you’ll find something useful here.

If you have ideas, feedback, or want to collaborate, I’d love to hear from you. The code is open source (GitHub repo), and contributions are always welcome.

Thanks for reading. Here’s to building safer, more transparent AI—one small tool at a time.

Detecting Policy Puppetry in LLM Prompts: A Simple, Transparent Approach

Table of contents

Why Not Just Use an LLM to Catch LLM Attacks?

How It Works

Integration with Rebuff

What’s Next?

Subscribe to my newsletter

Alex Alexapolsky

Alex Alexapolsky

Detecting Policy Puppetry in LLM Prompts: A Simple, Transparent Approach

Table of contents

Why Not Just Use an LLM to Catch LLM Attacks?

How It Works

Integration with Rebuff

What’s Next?

Why Share This?

Subscribe to my newsletter

Alex Alexapolsky

Alex Alexapolsky