Prompt Shaping: Measuring the Impact of Prompt Modifiers on Output Size and Format

What if you could guide large language models (LLMs) to output shorter, clearer, or more structured responses just by adjusting the prompt?

This experiment set out to answer a simple yet valuable question:

Can we shrink or reshape LLM output just by using natural language modifiers—without fine-tuning the model?


🎯 Motivation

The idea emerged from practical challenges:

  • Reducing LLM token cost in production

  • Shaping outputs to fit UI constraints

  • Enforcing structure (e.g. bullet points, headlines, Q&A)

  • Evaluating how different LLMs respond to prompts at scale

  • Desire to check if we can apply some idea of decorators to prompts

While some researchers explore fine-tuning for formatting, this project examines a “black-box” approach using only natural language instructions in the prompt.


🧠 Models Compared

All models were accessed through OpenRouter, enabling consistent, unified API calls:

Model IDLabel
gpt-3.5-turboGPT-3.5-Turbo
mistralai/mistral-7b-instructMistral-7B
anthropic/claude-3-haikuClaude-3-Haiku

🧪 Prompts and Modifiers

We designed 6 diverse prompts (medical, cooking, technical, natural science, customer support, health triage), and applied 14 modifiers:

  • Length-focused: brief, brief+60-words, minimal, minimal+30-tokens

  • Tone-focused: business, casual, legalish, headline, social-post

  • Format-focused: compact-bullet, qa, telegraph

  • Clarity & control: clarity, none

Each prompt-modifier-model combination was processed and measured.


📊 Chart 1: Average Token Count by Intent

➡️ Observation:

  • minimal and brief reduce output length.

  • headline and telegraph are more effective at compression.

  • mistral-7b tends to be more verbose than claude or gpt.

    ℹ️ Note
    GPT-3.5-Turbo appears to output nearly identical length responses regardless of modifier.
    This may reflect a token budgeting strategy or conservative formatting defaults.

    Injecting modifiers more effectively may require clearer meta-instructions, few-shot formatting examples, or structured prompt scaffolding.

    Further exploration of GPT-3.5-Turbo formatting behavior is left as an exercise for the reader.


📉 Chart 2: Average Entropy by Intent

Entropy is a proxy for lexical repetition and tightness of language.

➡️ Observation:

  • minimal and headline reduce entropy.

  • qa, social-post, and clarity tend to increase diversity.


🧱 Chart 3: Formatting Features Detected

We detected formatting traits like:

  • bullet (markdown-like lists)

  • numbered lists (1. 2. 3.)

  • qa patterns (Q: A:)

➡️ Observation:

  • compact-bullet and qa modifiers consistently triggered formatting

  • But even without instruction, some models use formatting naturally


🎯 Length Constraint Fulfillment

We wanted to see how often responses stayed under target length (e.g. 60 words, 30 tokens).

➡️ Observation:

  • GPT-3.5-Turbo is best at obeying length constraints

  • Claude and Mistral sometimes ignore soft instruction


📁 Raw Output for Further Analysis

CSV and HTML exports include:

  • Model, intent, prompt, full response

  • Token count and entropy

  • Detected formatting flags

→ You can extend the analysis to BLEU scores, ROUGE, compression ratio, or human evaluation.


Conclusion

This small-scale experiment began with a simple question:
Can prompt modifiers reduce the token size of LLM outputs without losing meaning?

What we discovered opened up broader implications:

  • ✂️ Some modifiers like brief, minimal, and token/word constraints were modestly effective at reducing response length, especially for simpler prompts.

  • 🧠 Others, like clarity, qa, or compact-bullet, didn’t always make outputs shorter — but reshaped the structure and readability in useful ways.

  • 🔍 Prompt modifiers act not only as instructions but as soft constraints — LLMs interpret them differently, and not always literally.

  • 🔄 Combining modifiers (e.g., brief + 60 words) shows the tension between style and compression — useful for prompt engineers and interface designers alike.

  • 📊 Metrics like token count, entropy, and format detection help surface subtle effects that aren’t obvious on a single run.

  • 🧹 In real applications, you might need prompt sanitation — stripping or rewriting user inputs before applying modifiers — to avoid conflicting instructions.

  • 🎛️ This points to an emerging space of techniques we could call:

    • Prompt Reshaping

    • Instruction Re-injection

    • Output Governance

We believe prompt modifiers have potential not just for saving tokens, but for helping shape outputs for different audiences, goals, and UI formats — from legal docs to social media blurbs.

🙏 Acknowledgements

Thanks to OpenRouter.ai for providing easy access to multiple LLMs through a unified API.
Your platform made experimentation across GPT, Claude, and Mistral seamless.



🙏 Acknowledgments

This project was made possible thanks to:

  • OpenRouter for model access

  • Anthropic, OpenAI, and Mistral teams for powerful APIs

  • Python OSS tools: matplotlib, pandas, tiktoken


🧪 Repo: github.com/metawake/llm-output-shaping-experimental

Feel free to fork it and try your own prompts or models!

0
Subscribe to my newsletter

Read articles from Alex Alexapolsky directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alex Alexapolsky
Alex Alexapolsky

Ukranian Python dev in Montenegro. https://www.linkedin.com/in/alexey-a-181a614/