Implementing A CAG Pipeline With Python


Yesterday, we looked at the introduction to Cache-Augmented Generation (CAG), and today, we’ll be implementing it in our PDF RAG pipeline. Instead of making retrieval calls to an external vector database every time, the model will query its cache for similar queries, making responses faster and more efficient.
We're still using Google Colab, but we’re changing the model and runtime! We’ll be working with Microsoft’s open-source LLM Phi-4 from Hugging Face and Google’s Gemini. It’s going to be exciting and shorter than you thought! 😁
We’re building a CAG pipeline that enables us to chat with PDFs. Instead of retrieving information dynamically every time, the model will cache previous queries and use them for faster inference.
(Don’t mind my bias for Python 🤣)
We are going to explore two ways to build a CAG pipeline, the first way is the easy way, and the second one is the hard way
The Easy Way: Using CAG on Gemini
When I was writing this blog initially, I only implemented the hard way but then I thought, why not make this technical mumbo jumbo less complex and more efficient? That’s when it hit me: Why not use CAG on Gemini? 🤔
I went through the Gemini API docs and found a caching feature we could use to implement CAG with Gemini.
Step 1: Install Dependencies
Before diving into the implementation, we need to install the google-genai
library, which allows us to interact with Google's Gemini API.
!pip install google-genai
Step 2: Set Up Google Gemini Client
Now that we have the required library, let's set up the Gemini API client using our API key. This client allows us to communicate with the Gemini model and send requests.
from google import genai
from google.genai import types
client = genai.Client(api_key="Your_API_Key")
Step 3: Loading the PDF Knowledge Base
Before we can use CAG, we need to convert our PDF into text so the model can process it. We’ll use fitz
from PyMuPDF to extract the text content from our uploaded PDF.
from google.colab import files
import fitz
# Upload a PDF
pdf = files.upload()
pdf_path = list(pdf.keys())[0]
# Open the PDF and extract text
doc = fitz.open(pdf_path)
full_text = "\n".join([page.get_text("text") for page in doc])
Step 4: Setting Up the Cache
Google Gemini provides a caching feature, allowing us to store large documents for faster and more cost-efficient queries. Instead of sending the entire document each time, we cache it once and reference it in subsequent queries.
It has a drawback, it doesn’t allow small documents to be cached, documents with less than 32k+ tokens
cache = client.caches.create(
model="models/gemini-1.5-flash-001",
config=types.CreateCachedContentConfig(
ttl="3600s" ,
display_name="PDF_CAG",
system_instruction=(
"You are a highly knowledgeable and detail-oriented AI assistant. "
"Your task is to carefully read and understand the document provided. "
"You should accurately answer users' questions based solely on the information "
"in the document without adding extra knowledge. "
"Provide concise, clear, and contextually relevant responses while maintaining professionalism."
),
contents=[full_text]
)
)
The role of each of the parameters:
ttl: It specifies how long the cache will live
display_name: It gives the cache storage a name
system_instruction: This determines the instructions and behaviour the model will work with
contents: This where the document content is loaded as text
Step 5: Querying the Cached Content
With the document stored in Gemini's cache, we can now interact with it as if it were a chatbot. Instead of retrieving external data, the model will query its cache, ensuring faster responses.
chat = client.chats.create(
model = "gemini-1.5-flash-001",
config=types.GenerateContentConfig(cached_content=cache.name)
)
user_input = input("Talk To Your PDFCAGBot:")
while user_input.lower() != "exit":
response = chat.send_message(user_input)
print("🤖:", response.text)
user_input = input("Say Something:")
Now, our PDFCAGBot efficiently queries its preloaded cache and generates responses much faster than standard retrieval-augmented generation! 🚀
🔥 The Hard Way: Using Phi-4 with Traditional KV Caching
If we prefer more manual control over caching, we can use Microsoft's Phi-4 model or any other model of your choice
Step 1: Processing the Knowledge Base
This is the easy part of the pipeline. We are going to extract text from our PDFs using fitz
from PyMuPDF
.
We will also use the files
module from Google Colab to upload PDFs from our device. Then, we will open the PDF using fitz
, iterate through its pages, extract the text, store it in a list, and finally join the list into a single string.
from google.colab import files
import fitz
# Upload a PDF
pdf = files.upload()
pdf_path = list(pdf.keys())[0]
# Extract text from PDF
doc = fitz.open(pdf_path)
full_text = "\n".join([page.get_text("text") for page in doc])
Step 2: Loading Phi-4 Model and Creating the KV Cache
Now, we will load the Microsoft Phi-4 model and its tokenizer using the transformers
library from Hugging Face. The extracted PDF text will be tokenized and passed to the model for processing.
The key part here is storing the model’s key-value pairs in a variable called kv_cache
, which will serve as our retrieval-free cache for future queries.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4")
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-4")
knowledgebase = tokenizer(full_text, return_tensors="pt").input_ids
with torch.no_grad():
output = model(knowledgebase, use_cache=True)
kv_cache = output.past_key_values
Note: We used
torch.no
_grad()
because we don’t need gradients for CAG, making it memory-efficient.
Step 3: Querying with Cache-Augmented Generation
Now, we will create a function that enables the model to use its cache to generate a faster response to user queries with context.
Since Phi-4 uses a DynamicCache instead of standard past_key_values
, we must adjust our function to use cache_position
instead.
How It Works:
Tokenize the user query.
Pass the query to the model while setting
past_key_values
tokv_cache
.The model remembers its cache and generates a response using the preloaded knowledge.
The output will include all previous keys and values plus the new query.
Since the output is iterable, we extract the last element, which contains our new response based on the query.
def answer_from_cache(kv_cache, query):
input_ids = tokenizer(query, return_tensors="pt").input_ids
with torch.no_grad():
output = model.generate(
input_ids,
cache_position=torch.tensor([0]),
)
return tokenizer.decode(output[0], skip_special_tokens=True)
Now, instead of retrieving knowledge from an external database, our model simply looks up its preloaded cache and generates a response much faster.
Handling Out-of-Context Queries
CAG is extremely efficient when dealing with static and small datasets, such as FAQs, customer support logs, or internal knowledge bases. However, what happens if a query is not found in the cache?
The Fallback Mechanism
Before returning an empty response, we can check if the query is similar to existing cached knowledge. If it is, we can guide the user to the closest available information while maintaining context.
def get_answer(query, kv_cache, knowledgebase):
try:
return answer_from_cache(kv_cache, query)
except Exception:
pass
knowledge_sentences = knowledgebase.split("\n")
for sentence in knowledge_sentences:
if query.lower() in sentence.lower():
new_query = f"""
query: {query}
context: {sentence}
"""
return answer_from_cache(kv_cache, new_query)
return "I'm sorry, but this query is out of scope for the current knowledge base. Try rephrasing or uploading additional documents."
Now, if the query isn't explicitly covered in the cached knowledge, we attempt to find the most relevant match, reformat the query with additional context, and then process it through the cache. If no match is found, we gracefully handle the out-of-context scenario.
Chatting with Your PDF
Finally, let’s integrate this into a simple chatbot loop where the user can interact with the PDF-based CAG system.
chat = input("Talk To Your PDFCAGBot:")
while chat.lower() != "exit":
response = get_answer(chat, kv_cache, full_text)
print("🤖:", response)
chat = input("Say Something:")
Now, you can ask your PDFCAGBot anything related to the uploaded document, and it will efficiently respond using its cache! 🚀
Here’s the link to the Google Colab Notebook
This implementation significantly reduces retrieval time and makes the model more efficient in knowledge-intensive tasks. CAG is particularly well-suited for static, structured knowledge bases like FAQs and company handbooks, but when dynamic updates are needed, integrating a hybrid CAG + RAG system ensures we get the best of both worlds. 🚀
Thanks for reading
I hope you put this to practice
See you on the next one!
Subscribe to my newsletter
Read articles from Paul Fruitful directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Paul Fruitful
Paul Fruitful
Skilled and results-oriented Software Developer with more than 5 years of experience working in a variety of environments with a breadth of programs and technologies. I am open and enthusiastic about ideas, solutions and problems. I am proficient with PHP, JavaScript and Python and I am well vast in a lot of programming and computing paradigms.