Semantic Prompt Compression: Reducing LLM Costs While Preserving Meaning

View the open-source project on GitHub


The Challenge: Every Token Costs

In the world of Large Language Models (LLMs), every token comes with a price tag. For organizations running thousands of prompts daily, these costs add up quickly. But what if we could reduce these costs without sacrificing the quality of interactions?


Real Results: Beyond Theory

Our experimental Semantic Prompt Compressor has shown promising results in real-world testing. Analyzing 135 diverse prompts, we achieved:

  • βœ… 22.42% average compression ratio

  • πŸ“‰ Reduction from 4,986 β†’ 3,868 tokens

  • πŸ’‘ 1,118 tokens saved while maintaining meaning

  • πŸ”’ \>95% preservation of named entities and technical terms


πŸ“Œ Example 1

Original (33 tokens):

I've been considering the role of technology in mental health treatment.
How might virtual therapy and digital interventions evolve?
I'm interested in both current applications and future possibilities.

Compressed (12 tokens):

I've been considering role of technology in mental health treatment.

Compression ratio: 63.64%


πŸ“Œ Example 2

Original (29 tokens):

All these apps keep asking for my location.
What are they actually doing with this information?
I'm curious about the balance between convenience and privacy.

Compressed (14 tokens):

apps keep asking for my location. What are they doing with information.

Compression ratio: 51.72%


πŸ’° The Cost Impact

Let’s translate these results into real business scenarios:

Customer Support AI (100,000 queries/day):

  • Avg. 200 tokens per query

  • GPT-4 API cost: $0.03 / 1K tokens

Without compression:

  • 20M tokens/day

  • $600 daily cost

  • $18,000 monthly cost

With 22.42% compression:

  • 15.5M tokens/day

  • $465 daily cost

  • πŸ’Έ Monthly savings: $4,050


🧠 How It Works: A Three-Layer Approach

1. Rules Layer

Instead of using a black-box ML model, we implemented a configurable rule system:

rule_groups:
  remove_fillers:
    enabled: true
    patterns:
      - pattern: "Could you explain"
        replacement: "explain"

  remove_greetings:
    enabled: true
    patterns:
      - pattern: "Hello, I was wondering"
        replacement: "I wonder"

2. spaCy NLP Layer

We leverage spaCy’s linguistic analysis for intelligent compression:

  • 🧠 Named Entity Recognition to preserve key terms

  • πŸ”— Dependency parsing for sentence structure

  • πŸ“Ÿ POS tagging to remove non-essential parts

  • πŸ›  Compound-word preservation for technical terms


3. Entity Preservation Layer

We ensure critical information is not lost:

  • πŸ§ͺ Technical terms (e.g., "5G", "TCP/IP")

  • 🧐 Named entities (companies, people, places)

  • πŸ“ Numerical values and measurements

  • πŸ“š Domain-specific vocabulary


πŸ›  Real-World Applications

πŸ§‘β€πŸ’Ό Customer Support

  • Compress user queries while maintaining context

  • Preserve product-specific language

  • Reduce support costs, maintain quality

πŸ›‘ Content Moderation

  • Efficiently process user reports

  • Maintain critical context

  • Cost-effective scaling

πŸ“š Technical Documentation

  • Compress API or doc queries

  • Preserve code snippets, terms

  • Cut costs without losing accuracy


✨ Beyond Simple Compression

What makes our approach unique:

  • Intelligent Preservation:
    Maintains technical accuracy and key data

  • Configurable Rules:
    Domain-adaptable, transparent, and editable

  • Transparent Processing:
    Understandable and debuggable


⚠️ Current Limitations

  • Requires domain-specific tuning

  • Conservative in technical contexts

  • Manual rule editing still helpful

  • Entity preservation may be overly cautious


πŸ”­ Future Development

  • ML-based adaptive compression

  • Domain-specific profiles

  • Real-time compression

  • LLM platform integrations

  • Custom vocabulary modules


βœ… Conclusion

The results from our testing show that intelligent semantic prompt compression is not only possible β€” it's practical.

With a 22.42% average compression ratio and high semantic preservation, LLM-based systems can reduce API costs while maintaining the quality and clarity of interactions.

Whether you're building support bots, moderation tools, or technical assistants, prompt compression could be a key layer in your stack.


🧹 Project on GitHub:
πŸ‘‰ github.com/metawake/prompt_compressor

Open source, transparent, and built for experimentation.

0
Subscribe to my newsletter

Read articles from Alex Alexapolsky directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alex Alexapolsky
Alex Alexapolsky

Ukranian Python dev in Montenegro. https://www.linkedin.com/in/alexey-a-181a614/