Semantic Prompt Compression: Reducing LLM Costs While Preserving Meaning

View the open-source project on GitHub
The Challenge: Every Token Costs
In the world of Large Language Models (LLMs), every token comes with a price tag. For organizations running thousands of prompts daily, these costs add up quickly. But what if we could reduce these costs without sacrificing the quality of interactions?
Real Results: Beyond Theory
Our experimental Semantic Prompt Compressor has shown promising results in real-world testing. Analyzing 135 diverse prompts, we achieved:
β 22.42% average compression ratio
π Reduction from 4,986 β 3,868 tokens
π‘ 1,118 tokens saved while maintaining meaning
π \>95% preservation of named entities and technical terms
π Example 1
Original (33 tokens):
I've been considering the role of technology in mental health treatment.
How might virtual therapy and digital interventions evolve?
I'm interested in both current applications and future possibilities.
Compressed (12 tokens):
I've been considering role of technology in mental health treatment.
Compression ratio: 63.64%
π Example 2
Original (29 tokens):
All these apps keep asking for my location.
What are they actually doing with this information?
I'm curious about the balance between convenience and privacy.
Compressed (14 tokens):
apps keep asking for my location. What are they doing with information.
Compression ratio: 51.72%
π° The Cost Impact
Letβs translate these results into real business scenarios:
Customer Support AI (100,000 queries/day):
Avg. 200 tokens per query
GPT-4 API cost: $0.03 / 1K tokens
Without compression:
20M tokens/day
$600 daily cost
$18,000 monthly cost
With 22.42% compression:
15.5M tokens/day
$465 daily cost
πΈ Monthly savings: $4,050
π§ How It Works: A Three-Layer Approach
1. Rules Layer
Instead of using a black-box ML model, we implemented a configurable rule system:
rule_groups:
remove_fillers:
enabled: true
patterns:
- pattern: "Could you explain"
replacement: "explain"
remove_greetings:
enabled: true
patterns:
- pattern: "Hello, I was wondering"
replacement: "I wonder"
2. spaCy NLP Layer
We leverage spaCy
βs linguistic analysis for intelligent compression:
π§ Named Entity Recognition to preserve key terms
π Dependency parsing for sentence structure
π POS tagging to remove non-essential parts
π Compound-word preservation for technical terms
3. Entity Preservation Layer
We ensure critical information is not lost:
π§ͺ Technical terms (e.g., "5G", "TCP/IP")
π§ Named entities (companies, people, places)
π Numerical values and measurements
π Domain-specific vocabulary
π Real-World Applications
π§βπΌ Customer Support
Compress user queries while maintaining context
Preserve product-specific language
Reduce support costs, maintain quality
π‘ Content Moderation
Efficiently process user reports
Maintain critical context
Cost-effective scaling
π Technical Documentation
Compress API or doc queries
Preserve code snippets, terms
Cut costs without losing accuracy
β¨ Beyond Simple Compression
What makes our approach unique:
Intelligent Preservation:
Maintains technical accuracy and key dataConfigurable Rules:
Domain-adaptable, transparent, and editableTransparent Processing:
Understandable and debuggable
β οΈ Current Limitations
Requires domain-specific tuning
Conservative in technical contexts
Manual rule editing still helpful
Entity preservation may be overly cautious
π Future Development
ML-based adaptive compression
Domain-specific profiles
Real-time compression
LLM platform integrations
Custom vocabulary modules
β Conclusion
The results from our testing show that intelligent semantic prompt compression is not only possible β it's practical.
With a 22.42% average compression ratio and high semantic preservation, LLM-based systems can reduce API costs while maintaining the quality and clarity of interactions.
Whether you're building support bots, moderation tools, or technical assistants, prompt compression could be a key layer in your stack.
π§Ή Project on GitHub:
π github.com/metawake/prompt_compressor
Open source, transparent, and built for experimentation.
Subscribe to my newsletter
Read articles from Alex Alexapolsky directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Alex Alexapolsky
Alex Alexapolsky
Ukranian Python dev in Montenegro. https://www.linkedin.com/in/alexey-a-181a614/