Detecting Policy Puppetry in LLM Prompts: A Simple, Transparent Approach

There’s a lot of excitement these days about large language models (LLMs) and the new opportunities—and risks—they bring. As someone who’s spent a fair bit of time in the NLP and LLM trenches, I’ve seen firsthand how creative people can get when trying to “puppeteer” these models into doing things they shouldn’t.

Recently, I built a small tool called Puppetry Detector. It’s not a silver bullet, and it’s certainly not the only approach out there, but I think it fills a useful niche: detecting policy puppetry and prompt injection attempts using nothing more than plain Python and regular expressions.

Why Not Just Use an LLM to Catch LLM Attacks?

It’s tempting to throw a big model at every problem. But sometimes, the simplest solution is the best place to start. Regex patterns are:

  • Fast (milliseconds, not seconds)

  • Transparent (you can see exactly what’s being matched)

  • Easy to audit and extend (no black boxes)

  • Lightweight (no GPU, no cloud, no fuss)

I wanted something that could be dropped into a pipeline, run on a Raspberry Pi, or be used as a building block for more complex systems. If you need more power, you can always layer on spaCy, or even call out to an LLM for a second opinion. But for a lot of cases, simple pattern matching gets you surprisingly far.

How It Works

Puppetry Detector scans prompts for patterns that look like attempts to override roles, bypass security, or otherwise manipulate the model’s behavior. It’s inspired by research from HiddenLayer and others, and is designed to be:

  • Configurable: You can add or tweak patterns as new attacks emerge.

  • Composable: Use it on its own, or as part of a larger moderation pipeline.

Integration with Rebuff

One of the things I appreciate about the Python ecosystem is how easy it is to make tools work together. Puppetry Detector comes with a simple adapter for Rebuff, a prompt moderation framework. This means you can use it as a filter in your Rebuff pipeline, combining it with other detectors for layered defense.

What’s Next?

This project is a starting point, not an endpoint. There’s plenty of room to grow:

  • Smarter Patterns: As attackers get more creative, so can our regexes.

  • Deeper Analysis: Plug in spaCy or other NLP tools for more nuanced checks.

  • LLM-Assisted Moderation: For the really tricky cases, call out to a model to estimate intent or even spot subtle lies.

Why Share This?

I’m sharing this not because I think it’s perfect, but because I believe in the value of small, understandable tools. If you’re working in LLM safety, prompt engineering, or just want to see how far you can get with a few lines of Python, I hope you’ll find something useful here.

If you have ideas, feedback, or want to collaborate, I’d love to hear from you. The code is open source (GitHub repo), and contributions are always welcome.


Thanks for reading. Here’s to building safer, more transparent AI—one small tool at a time.

0
Subscribe to my newsletter

Read articles from Alex Alexapolsky directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alex Alexapolsky
Alex Alexapolsky

Ukranian Python dev in Montenegro. https://www.linkedin.com/in/alexey-a-181a614/