Prompt Caching: Speeding up LLMs by remembering what they've seen.

I have come across Prompt Caching Technique when learning about RAG Architecture.

Overview:

Prompt Caching is technique where the System stores the repetitive sentences/words from our Prompts to save time and Cost.

Mainly the prompts contain mostly the same words - so prompt caching stores the repetitive parts of the Prompt.
When we send the same prompt again - System recognizes the repetitive words and instead of reprocessing them - it uses the stored version.

Benefits of Prompt Caching:

Faster API Response as the LLM doesn’t have to process all words.
Lower Cost (Since the LLM requires less processing power because the repetitive words are not processed again)

It's like giving the model a "memory" of previous computations, allowing it to quickly retrieve and use that information.

I was curious about “system recognizes “part that how the LLM able to recognize these repetitive words without processing it. Below are the few methods which followed by LLM for Prompt Caching

Exact Prefix Matching:
- How it works: The system looks at the beginning (the prefix) of the incoming prompt. It compares this prefix, token by token, with prefixes it has already processed and stored in its cache from recent requests.
- Recognition: If the entire sequence of tokens at the beginning of the new prompt exactly matches a sequence stored in the cache, it's a "cache hit." The system recognizes that this part doesn't need reprocessing.
- This is used by providers like OpenAI and Anthropic.
- Example:
  - Cached Prompt Prefix: "System: You are a helpful assistant. User: Analyze this document: [long document text]"
  - New Prompt: "System: You are a helpful assistant. User: Analyze this document: [long document text] User: Now summarize the key points."
  - Recognition: The system sees that the beginning of the new prompt is identical to the cached prefix up to the end of the document text. It recognizes this match.
- Why it works: This method is simple and efficient. It guarantees that the reused part is identical, ensuring the model's state is consistent. However, even a tiny change (like a single different character or extra space) in the prefix will break the match ("cache miss").
Cryptographic Hashing (Often used internally for cache keys):
- How it works: While not the primary recognition method for the content itself during comparison, hashing is often used to create unique identifiers (cache keys) for the stored prefixes. The system calculates a unique digital fingerprint (a hash) for the sequence of tokens in the cacheable prefix.
- Recognition: When a new request comes in, the system calculates the hash for its prefix. It then quickly checks if a cache entry with that exact same hash exists. If it does, it means the content is identical.
- Benefit: Hashing is very fast for looking up whether an exact match exists in a large cache. It also helps ensure data privacy, as the system deals with hashes rather than the raw text directly for lookup.
Semantic Similarity (Less common for direct prefix caching, more for related concepts):
- How it works: This is more advanced and less commonly used for the strict "prefix caching" we've discussed, but related concepts exist (like GPTCache mentioned in one of the articles). Instead of looking for an exact match, the system might use techniques like vector embeddings to determine if a new prompt is semantically similar (i.e., has a similar meaning) to a cached one.
- Recognition: If the meaning is close enough (above a certain threshold), it might retrieve a related cached response or state.
- Caveat: This is generally not how standard prompt prefix caching works in major APIs like OpenAI or Anthropic for cost/latency reduction, as they rely on exact matches to guarantee identical processing states. Semantic matching is more complex and might be used in different types of caching systems.

Prompt Caching in Different Platforms:

OpenAI: Automatically enabled for prompts longer than 1024 tokens. Static content should be at the beginning of the prompt for optimal caching. Prompt caching - OpenAI API
Anthropic: Allows you to cache specific parts of prompts using the cache_control parameter. They emphasize placing static content at the beginning of the prompt. Prompt caching - Anthropic API
Amazon Bedrock: Implements caching with "cache checkpoints" to mark the beginning of the cacheable section. Prompt caching for faster model inference - Amazon Bedrock

Prompt Caching - Simple Explanation

Subscribe to my newsletter

AbhisekH MukherjeE

AbhisekH MukherjeE