What if you could guide large language models (LLMs) to output shorter, clearer, or more structured responses just by adjusting the prompt?

This experiment set out to answer a simple yet valuable question:

Can we shrink or reshape LLM output just by using natural language modifiers—without fine-tuning the model?

🎯 Motivation

The idea emerged from practical challenges:

Reducing LLM token cost in production
Shaping outputs to fit UI constraints
Enforcing structure (e.g. bullet points, headlines, Q&A)
Evaluating how different LLMs respond to prompts at scale
Desire to check if we can apply some idea of decorators to prompts

While some researchers explore fine-tuning for formatting, this project examines a “black-box” approach using only natural language instructions in the prompt.

🧠 Models Compared

All models were accessed through OpenRouter, enabling consistent, unified API calls:

Model ID	Label
`gpt-3.5-turbo`	GPT-3.5-Turbo
`mistralai/mistral-7b-instruct`	Mistral-7B
`anthropic/claude-3-haiku`	Claude-3-Haiku

🧪 Prompts and Modifiers

We designed 6 diverse prompts (medical, cooking, technical, natural science, customer support, health triage), and applied 14 modifiers:

Length-focused: brief, brief+60-words, minimal, minimal+30-tokens
Tone-focused: business, casual, legalish, headline, social-post
Format-focused: compact-bullet, qa, telegraph
Clarity & control: clarity, none

Each prompt-modifier-model combination was processed and measured.

📊 Chart 1: Average Token Count by Intent

➡️ Observation:

minimal and brief reduce output length.
headline and telegraph are more effective at compression.
mistral-7b tends to be more verbose than claude or gpt.

ℹ️ Note
GPT-3.5-Turbo appears to output nearly identical length responses regardless of modifier.
This may reflect a token budgeting strategy or conservative formatting defaults.

Injecting modifiers more effectively may require clearer meta-instructions, few-shot formatting examples, or structured prompt scaffolding.

Further exploration of GPT-3.5-Turbo formatting behavior is left as an exercise for the reader.

📉 Chart 2: Average Entropy by Intent

Entropy is a proxy for lexical repetition and tightness of language.

➡️ Observation:

minimal and headline reduce entropy.
qa, social-post, and clarity tend to increase diversity.

🧱 Chart 3: Formatting Features Detected

We detected formatting traits like:

bullet (markdown-like lists)
numbered lists (1. 2. 3.)
qa patterns (Q: A:)

➡️ Observation:

compact-bullet and qa modifiers consistently triggered formatting
But even without instruction, some models use formatting naturally

🎯 Length Constraint Fulfillment

We wanted to see how often responses stayed under target length (e.g. 60 words, 30 tokens).

➡️ Observation:

GPT-3.5-Turbo is best at obeying length constraints
Claude and Mistral sometimes ignore soft instruction

📁 Raw Output for Further Analysis

CSV and HTML exports include:

Model, intent, prompt, full response
Token count and entropy
Detected formatting flags

→ You can extend the analysis to BLEU scores, ROUGE, compression ratio, or human evaluation.

Conclusion

This small-scale experiment began with a simple question:
Can prompt modifiers reduce the token size of LLM outputs without losing meaning?

What we discovered opened up broader implications:

✂️ Some modifiers like brief, minimal, and token/word constraints were modestly effective at reducing response length, especially for simpler prompts.
🧠 Others, like clarity, qa, or compact-bullet, didn’t always make outputs shorter — but reshaped the structure and readability in useful ways.
🔍 Prompt modifiers act not only as instructions but as soft constraints — LLMs interpret them differently, and not always literally.
🔄 Combining modifiers (e.g., brief + 60 words) shows the tension between style and compression — useful for prompt engineers and interface designers alike.
📊 Metrics like token count, entropy, and format detection help surface subtle effects that aren’t obvious on a single run.
🧹 In real applications, you might need prompt sanitation — stripping or rewriting user inputs before applying modifiers — to avoid conflicting instructions.
🎛️ This points to an emerging space of techniques we could call:
- Prompt Reshaping
- Instruction Re-injection
- Output Governance

We believe prompt modifiers have potential not just for saving tokens, but for helping shape outputs for different audiences, goals, and UI formats — from legal docs to social media blurbs.

🙏 Acknowledgements

Thanks to OpenRouter.ai for providing easy access to multiple LLMs through a unified API.
Your platform made experimentation across GPT, Claude, and Mistral seamless.

🙏 Acknowledgments

This project was made possible thanks to:

OpenRouter for model access
Anthropic, OpenAI, and Mistral teams for powerful APIs
Python OSS tools: matplotlib, pandas, tiktoken

🧪 Repo: github.com/metawake/llm-output-shaping-experimental

Feel free to fork it and try your own prompts or models!

Prompt Shaping: Measuring the Impact of Prompt Modifiers on Output Size and Format

🎯 Motivation

🧠 Models Compared

🧪 Prompts and Modifiers

📊 Chart 1: Average Token Count by Intent

📉 Chart 2: Average Entropy by Intent

🧱 Chart 3: Formatting Features Detected

🎯 Length Constraint Fulfillment

📁 Raw Output for Further Analysis

Conclusion

🙏 Acknowledgements

🙏 Acknowledgments

Subscribe to my newsletter

Alex Alexapolsky

Alex Alexapolsky

Prompt Shaping: Measuring the Impact of Prompt Modifiers on Output Size and Format

🎯 Motivation

🧠 Models Compared

🧪 Prompts and Modifiers

📊 Chart 1: Average Token Count by Intent

📉 Chart 2: Average Entropy by Intent

🧱 Chart 3: Formatting Features Detected

🎯 Length Constraint Fulfillment

📁 Raw Output for Further Analysis

Conclusion

🙏 Acknowledgements

🔗 Related Work

🙏 Acknowledgments

Subscribe to my newsletter

Alex Alexapolsky

Alex Alexapolsky