Prompt Caching - Simple Explanation

I have come across Prompt Caching Technique when learning about RAG Architecture.

Overview:

Prompt Caching is technique where the System stores the repetitive sentences/words from our Prompts to save time and Cost.

  1. Mainly the prompts contain mostly the same words - so prompt caching stores the repetitive parts of the Prompt.

  2. When we send the same prompt again - System recognizes the repetitive words and instead of reprocessing them - it uses the stored version.

Benefits of Prompt Caching:

  1. Faster API Response as the LLM doesn’t have to process all words.

  2. Lower Cost (Since the LLM requires less processing power because the repetitive words are not processed again)

It's like giving the model a "memory" of previous computations, allowing it to quickly retrieve and use that information.

I was curious about “system recognizes “part that how the LLM able to recognize these repetitive words without processing it. Below are the few methods which followed by LLM for Prompt Caching

  • Exact Prefix Matching:

    • How it works: The system looks at the beginning (the prefix) of the incoming prompt. It compares this prefix, token by token, with prefixes it has already processed and stored in its cache from recent requests.

    • Recognition: If the entire sequence of tokens at the beginning of the new prompt exactly matches a sequence stored in the cache, it's a "cache hit." The system recognizes that this part doesn't need reprocessing.

    • This is used by providers like OpenAI and Anthropic.

    • Example:

      • Cached Prompt Prefix: "System: You are a helpful assistant. User: Analyze this document: [long document text]"

      • New Prompt: "System: You are a helpful assistant. User: Analyze this document: [long document text] User: Now summarize the key points."

      • Recognition: The system sees that the beginning of the new prompt is identical to the cached prefix up to the end of the document text. It recognizes this match.

    • Why it works: This method is simple and efficient. It guarantees that the reused part is identical, ensuring the model's state is consistent. However, even a tiny change (like a single different character or extra space) in the prefix will break the match ("cache miss").

  • Cryptographic Hashing (Often used internally for cache keys):

    • How it works: While not the primary recognition method for the content itself during comparison, hashing is often used to create unique identifiers (cache keys) for the stored prefixes. The system calculates a unique digital fingerprint (a hash) for the sequence of tokens in the cacheable prefix.

    • Recognition: When a new request comes in, the system calculates the hash for its prefix. It then quickly checks if a cache entry with that exact same hash exists. If it does, it means the content is identical.

    • Benefit: Hashing is very fast for looking up whether an exact match exists in a large cache. It also helps ensure data privacy, as the system deals with hashes rather than the raw text directly for lookup.

  • Semantic Similarity (Less common for direct prefix caching, more for related concepts):

    • How it works: This is more advanced and less commonly used for the strict "prefix caching" we've discussed, but related concepts exist (like GPTCache mentioned in one of the articles). Instead of looking for an exact match, the system might use techniques like vector embeddings to determine if a new prompt is semantically similar (i.e., has a similar meaning) to a cached one.

    • Recognition: If the meaning is close enough (above a certain threshold), it might retrieve a related cached response or state.

    • Caveat: This is generally not how standard prompt prefix caching works in major APIs like OpenAI or Anthropic for cost/latency reduction, as they rely on exact matches to guarantee identical processing states. Semantic matching is more complex and might be used in different types of caching systems.

Prompt Caching in Different Platforms:

0
Subscribe to my newsletter

Read articles from AbhisekH MukherjeE directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

AbhisekH MukherjeE
AbhisekH MukherjeE

An Engineering Individual