Prompt Shaping: Measuring the Impact of Prompt Modifiers on Output Size and Format

What if you could guide large language models (LLMs) to output shorter, clearer, or more structured responses just by adjusting the prompt?
This experiment set out to answer a simple yet valuable question:
Can we shrink or reshape LLM output just by using natural language modifiers—without fine-tuning the model?
🎯 Motivation
The idea emerged from practical challenges:
Reducing LLM token cost in production
Shaping outputs to fit UI constraints
Enforcing structure (e.g. bullet points, headlines, Q&A)
Evaluating how different LLMs respond to prompts at scale
Desire to check if we can apply some idea of decorators to prompts
While some researchers explore fine-tuning for formatting, this project examines a “black-box” approach using only natural language instructions in the prompt.
🧠 Models Compared
All models were accessed through OpenRouter, enabling consistent, unified API calls:
Model ID | Label |
gpt-3.5-turbo | GPT-3.5-Turbo |
mistralai/mistral-7b-instruct | Mistral-7B |
anthropic/claude-3-haiku | Claude-3-Haiku |
🧪 Prompts and Modifiers
We designed 6 diverse prompts (medical, cooking, technical, natural science, customer support, health triage), and applied 14 modifiers:
Length-focused:
brief
,brief+60-words
,minimal
,minimal+30-tokens
Tone-focused:
business
,casual
,legalish
,headline
,social-post
Format-focused:
compact-bullet
,qa
,telegraph
Clarity & control:
clarity
,none
Each prompt-modifier-model combination was processed and measured.
📊 Chart 1: Average Token Count by Intent
➡️ Observation:
minimal
andbrief
reduce output length.headline
andtelegraph
are more effective at compression.mistral-7b
tends to be more verbose thanclaude
orgpt
.ℹ️ Note
GPT-3.5-Turbo appears to output nearly identical length responses regardless of modifier.
This may reflect a token budgeting strategy or conservative formatting defaults.Injecting modifiers more effectively may require clearer meta-instructions, few-shot formatting examples, or structured prompt scaffolding.
Further exploration of GPT-3.5-Turbo formatting behavior is left as an exercise for the reader.
📉 Chart 2: Average Entropy by Intent
Entropy is a proxy for lexical repetition and tightness of language.
➡️ Observation:
minimal
andheadline
reduce entropy.qa
,social-post
, andclarity
tend to increase diversity.
🧱 Chart 3: Formatting Features Detected
We detected formatting traits like:
bullet
(markdown-like lists)numbered
lists (1. 2. 3.)qa
patterns (Q: A:)
➡️ Observation:
compact-bullet
andqa
modifiers consistently triggered formattingBut even without instruction, some models use formatting naturally
🎯 Length Constraint Fulfillment
We wanted to see how often responses stayed under target length (e.g. 60 words, 30 tokens).
➡️ Observation:
GPT-3.5-Turbo
is best at obeying length constraintsClaude
andMistral
sometimes ignore soft instruction
📁 Raw Output for Further Analysis
CSV and HTML exports include:
Model, intent, prompt, full response
Token count and entropy
Detected formatting flags
→ You can extend the analysis to BLEU scores, ROUGE, compression ratio, or human evaluation.
Conclusion
This small-scale experiment began with a simple question:
Can prompt modifiers reduce the token size of LLM outputs without losing meaning?
What we discovered opened up broader implications:
✂️ Some modifiers like
brief
,minimal
, and token/word constraints were modestly effective at reducing response length, especially for simpler prompts.🧠 Others, like
clarity
,qa
, orcompact-bullet
, didn’t always make outputs shorter — but reshaped the structure and readability in useful ways.🔍 Prompt modifiers act not only as instructions but as soft constraints — LLMs interpret them differently, and not always literally.
🔄 Combining modifiers (e.g.,
brief + 60 words
) shows the tension between style and compression — useful for prompt engineers and interface designers alike.📊 Metrics like token count, entropy, and format detection help surface subtle effects that aren’t obvious on a single run.
🧹 In real applications, you might need prompt sanitation — stripping or rewriting user inputs before applying modifiers — to avoid conflicting instructions.
🎛️ This points to an emerging space of techniques we could call:
Prompt Reshaping
Instruction Re-injection
Output Governance
We believe prompt modifiers have potential not just for saving tokens, but for helping shape outputs for different audiences, goals, and UI formats — from legal docs to social media blurbs.
🙏 Acknowledgements
Thanks to OpenRouter.ai for providing easy access to multiple LLMs through a unified API.
Your platform made experimentation across GPT, Claude, and Mistral seamless.
🔗 Related Work
🙏 Acknowledgments
This project was made possible thanks to:
OpenRouter for model access
Anthropic, OpenAI, and Mistral teams for powerful APIs
Python OSS tools:
matplotlib
,pandas
,tiktoken
🧪 Repo: github.com/metawake/llm-output-shaping-experimental
Feel free to fork it and try your own prompts or models!
Subscribe to my newsletter
Read articles from Alex Alexapolsky directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Alex Alexapolsky
Alex Alexapolsky
Ukranian Python dev in Montenegro. https://www.linkedin.com/in/alexey-a-181a614/