RAG CAG or Both?

As impressive as they are, large language models (LLMs) have a critical limitation: they can’t know what they haven’t learned. This means that recent or domain-specific information — like the latest monetary policy decisions from the FED — simply doesn’t exist for the model unless it was included in its training data. To overcome this limitation and expand the usefulness of LLMs, two powerful strategies have emerged: RAG (Retrieval-Augmented Generation) and CAG (Cache-Augmented Generation).

1 – Understanding RAG

RAG works like a dynamic duo: it retrieves first, then generates. It’s structured in two phases: offline and online.

1.1 – Offline Phase

In the offline phase, documents are broken down into small chunks and converted into vector embeddings using embedding models. These vectors are stored in a vector database, creating an index that can be searched quickly and intelligently.

But what exactly are vector embeddings?

Imagine you have three sentences:

“The dog is playing in the park.”
“A pup ran across the lawn.”
“The cake recipe uses flour and eggs.”

Even though they use different words, the first two sentences talk about the same thing: a dog playing outdoors. The third one, however, is about something completely different: cooking.

An embedding model converts each of these sentences into a numerical vector, for example:

Vector A (sentence 1): [0.21, -0.11, 0.98, ...]
Vector B (sentence 2): [0.23, -0.13, 0.96, ...]
Vector C (sentence 3): [0.80, 0.45, -0.02, ...]

In the vector space, vectors A and B will be very close to each other because they represent similar ideas. Vector C, on the other hand, will be much farther away, since its content is unrelated to the others.

1.2 - Online Phase

In the online phase, when a user asks a question like “What animal runs in the park?”, that question is also converted into a vector. The system then compares this vector to all the vectors in the database, finds the most similar ones — in this case, the sentences about the dog and the pup — and sends that information as context to the language model. The LLM uses this data to generate a more accurate and well-informed response.

This technique of converting text into numbers and searching based on meaning — not just exact words — is what makes RAG so powerful for systems that need to answer questions across diverse and complex content.

1.3 - Example

# Import FAISS, a vector database for storing document embeddings
from langchain_community.vectorstores import FAISS
# Import a free sentence embedding model from Hugging Face
from langchain_huggingface import HuggingFaceEmbeddings

# Import a language model interface from Groq via LangChain
from langchain_groq import ChatGroq
# Import the RetrievalQA chain that handles RAG logic
from langchain.chains import RetrievalQA

from dotenv import load_dotenv
import os
load_dotenv()

API_KEY = os.getenv("GROQ_API_KEY")
MODEL = os.getenv("GROQ_MODEL")

# =========================
# 1. KNOWLEDGE BASE
# =========================

# A small list of factual documents for the system to retrieve from
documents = [
    "In June 2025, the Federal Reserve raised the interest rate to 5.25%.",
    "ChatGPT is a language model developed by OpenAI.",
    "Python is a popular programming language for data science and AI."
]

# ===============================
# 2. EMBEDDING GENERATION
# ===============================

# Load a lightweight and fast embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# Convert all documents to vectors and store them in a FAISS index
vectorstore = FAISS.from_texts(documents, embedding_model)

# ================================
# 3. RAG PIPELINE SETUP
# ================================

# Create a retriever that performs semantic similarity searches
retriever = vectorstore.as_retriever()

# Initialize the language model (LLM) with your Groq API key
llm = ChatGroq(
    api_key=API_KEY,     # Replace with your actual Groq API key
    model=MODEL           # Use Groq's free LLaMA3-8B model
)
# Connect the retriever and the LLM in a RAG pipeline
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# ================================
# 4. USER QUERY
# ================================

# Define the user's question
question = "What is the current interest rate set by the Federal Reserve?"
# Run the RAG chain: retrieve relevant info and generate the answer
answer = qa_chain.run(question)
# Output the final response
print(f"❓ Question: {question}\n💡 Answer: {answer}")

# ================================
# 5. OUTPUT
# ================================
#❓ Question: What is the current interest rate set by the Federal Reserve?
#💡 Answer: The current interest rate set by the Federal Reserve is 5.25%, as of June 2025.

2 – Understanding CAG

CAG takes the opposite approach: instead of retrieving only what’s relevant, it loads everything that might be needed all at once. The entire knowledge base is structured into a single prompt, which is fully processed by the LLM. This processing generates an internal cache (known as the KV cache), which acts as a temporary memory where the model has already “read” and stored all the content.

With that in place, when the model receives a question, it can respond quickly and efficiently using the preloaded information.

CAG is ideal when working with small, fixed knowledge sets — especially when low latency is essential, such as in technical support bots with limited manuals or embedded systems that require instant responses

2.1 - Example

from langchain_groq import ChatGroq
from langchain.schema import HumanMessage

from dotenv import load_dotenv
import os
load_dotenv()

API_KEY = os.getenv("GROQ_API_KEY")
MODEL = os.getenv("GROQ_MODEL")


# 1. Simulated knowledge base (will be stored in the model’s KV cache)
knowledge_base = """
Here is the product manual for Device X:
- To restart Device X, press and hold the power button for 5 seconds.
- To reset it to factory settings, press the reset pinhole for 10 seconds.
- Battery life typically lasts 8 hours on full charge.
"""

# 2. User’s question — added after the cache is built
user_question = "How do I restart Device X?"

# 3. Full context prompt — knowledge base + query
# The LLM will process the entire prompt, store the knowledge in its KV cache, then generate a response
full_prompt = f"{knowledge_base}\n\nQuestion: {user_question}"

# 4. Initialize the LLM (Groq model) — will process and cache the input
llm = ChatGroq(
    api_key=API_KEY,  # Replace with your actual API key
    model=MODEL        # This model supports large context windows and fast responses
)

# 5. Send the full prompt — the model internally builds a KV cache for efficient generation
response = llm.invoke([
    HumanMessage(content=full_prompt)
])
# 6. Output the user question
print("❓ Question:", user_question )
# 7. Output the response (generated using the cached knowledge)
print("💡 Answer:", response.content)

# ================================
# OUTPUT
# ================================
#❓ Question: How do I restart Device X?
#💡 Answer: To restart Device X, simply press and hold the power button for 5 seconds.

3 - RAG vs. CAG

While both RAG and CAG aim to enhance the reasoning capabilities of language models by providing external knowledge, they follow very different approaches — and each comes with trade-offs. To better understand when to use one over the other, let’s break down how they compare across four key dimensions: accuracy, latency, scalability, and data updates.

3.1 - Accuracy (CAG Wins)

In RAG, accuracy depends on the strength of the retrieval mechanism. If it fails to fetch the right information, the model might generate incorrect answers. In CAG, all the information is already present, but the model might get lost in irrelevant content since it has to sift through everything on its own.

3.2 - Latency (CAG Wins)

RAG introduces a search step, which slightly increases latency. CAG, on the other hand, responds almost instantly after the initial load, since no additional retrieval is needed.

3.3 - Scalability (RAG Wins)

This is where RAG truly shines. It can index millions of documents and retrieve only what’s relevant per query. CAG, by contrast, is limited by the model’s context window — typically between 32,000 and 100,000 tokens — which restricts how much content can be preloaded at once.

3.4 - Data Updates (RAG Wins)

RAG supports fast, incremental updates. CAG, however, relies on a full cache and needs to be completely reloaded when information changes, which can be costly in dynamic systems.

4 - When Should You Use Each One?

Use RAG when your knowledge base is large, frequently updated, or when you need high precision with traceability — such as:

Legal research assistants that search case law and regulations
Academic research tools that access vast libraries or papers
Enterprise chatbots accessing dynamic internal documentation
News summarizers or fact-checkers with real-time information
Medical assistants that need to cite updated clinical guidelines

Use CAG when your knowledge is static, compact, and you need fast, uninterrupted responses — such as:

Technical support bots with compact product manuals
Offline chatbots embedded in mobile apps or IoT devices
Interactive voice assistants for predefined domains (e.g., hotel check-in instructions)
User guides and FAQs that rarely change
Educational tutors teaching fixed curricula (e.g., math formulas, historical facts)

Use RAG+CAG in critical systems that demand efficient retrieval and seamless, context-rich follow-up — such as:

Clinical decision support systems where a patient’s history is preloaded (CAG), and recent medical literature is retrieved (RAG)
Legal assistants that preload the client’s case file (CAG) and search for matching precedents (RAG)
Enterprise copilots that cache session-specific data (like an ongoing project) while retrieving new info from company databases
Multistep reasoning agents that combine fixed prompts with dynamic lookups to maintain coherence across dialogue turns

5 - CAG with RAG

In this hybrid model, the process begins just like standard RAG: the system receives a user question, converts it into a vector, and performs a semantic search in a vector database. From this search, it retrieves only the most relevant passages related to the query. Up to this point, it operates exactly like traditional RAG.

The difference comes in the next step. Instead of sending the retrieved chunks in a fragmented or incremental way, the system consolidates all the retrieved content into a single, structured, and cohesive prompt with aditional knowledge data. This combined context, along with the original question, is then sent to the language model, which processes it all in a single forward pass. The LLM stores this information in its temporary memory — known as the KV Cache — enabling it to generate a more coherent, connected, and context-aware response.

5.1 - Example

from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
from langchain.schema import HumanMessage


from dotenv import load_dotenv
import os
load_dotenv()

API_KEY = os.getenv("GROQ_API_KEY")
MODEL = os.getenv("GROQ_MODEL")

# 1. Patient history (static context) — CAG
patient_history = """
Patient: Jane Smith
Age: 72
Condition: Chronic atrial fibrillation
History:
- Currently on Warfarin for anticoagulation
- Diagnosed with mild renal impairment
- Has experienced occasional dizziness and one minor fall in the past year
"""

# 2. Simulated medical literature — RAG corpus
medical_literature = [
    "A 2023 study in the Journal of Cardiology suggests DOACs are safer than Warfarin in elderly patients with atrial fibrillation.",
    "Renal function should be monitored when prescribing DOACs in patients over 70 years old.",
    "Anticoagulant-related falls are more common in patients over 65, especially with Warfarin.",
    "Aspirin is no longer recommended for stroke prevention in patients with atrial fibrillation due to limited efficacy."
]

# 3. Create vector store from literature (RAG)
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_texts(medical_literature, embedding_model)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})  # top 2 relevant

# 4. User query
question = "Should Jane Smith continue using Warfarin or consider switching to a DOAC?"

# 5. Retrieve relevant medical literature
retrieved_docs = retriever.invoke(question)
rag_context = "\n".join(f"[Study {i+1}]\n{doc.page_content.strip()}" for i, doc in enumerate(retrieved_docs))

# 6. Build full prompt: CAG (static history) + RAG (retrieved evidence)
full_prompt = f"""
You are a clinical decision support assistant.

==== Patient History ====
{patient_history}

==== Retrieved Literature ====
{rag_context}

==== Clinical Question ====
{question}

Please provide a well-justified recommendation, 
considering both the patient's history and the latest medical evidence.
"""

print("=======================\n FULL PROMPT\n=======================\n", full_prompt)

# 7. Call the LLM
llm = ChatGroq(api_key=API_KEY, model=MODEL)
response = llm.invoke([HumanMessage(content=full_prompt)])

# 8. Output the answer
print("=======================\n ANSWER\n=======================\n", response.content)

The resulting output would be something like this:

"""
=======================
 FULL PROMPT
=======================

You are a clinical decision support assistant.

==== Patient History ====

Patient: Jane Smith
Age: 72
Condition: Chronic atrial fibrillation
History:
- Currently on Warfarin for anticoagulation
- Diagnosed with mild renal impairment
- Has experienced occasional dizziness and one minor fall in the past year


==== Retrieved Literature ====
[Study 1]
A 2023 study in the Journal of Cardiology suggests DOACs are safer than Warfarin in elderly patients 
with atrial fibrillation.
[Study 2]
Renal function should be monitored when prescribing DOACs in patients over 70 years old.

==== Clinical Question ====
Should Jane Smith continue using Warfarin or consider switching to a DOAC?

Please provide a well-justified recommendation, considering both the patient's history and the 
latest medical evidence.       

=======================
 ANSWER
=======================
 Based on the patient's history and the latest medical evidence, I recommend considering switching 
Jane Smith from Warfarin to a Direct Oral Anticoagulant (DOAC). Here's a well-justified reasoning 
for this recommendation:

1. **Safety in Elderly Patients**: The 2023 study in the Journal of Cardiology suggests that DOACs 
are safer than Warfarin in elderly patients with atrial fibrillation. Given Jane's age (72), this 
study's findings are particularly relevant, indicating a potential reduction in risk when using DOACs
compared to Warfarin.

2. **Renal Impairment Consideration**: Jane has been diagnosed with mild renal impairment. While 
Study 2 advises monitoring renal function when prescribing DOACs to patients over 70, this does 
not necessarily contraindicate their use. In fact, some DOACs have been shown to be effective and
safe in patients with mild renal impairment, provided that renal function is regularly monitored. 
This suggests the need for careful dose selection and monitoring rather than a blanket avoidance of 
DOACs.

3. **History of Dizziness and Falls**: Jane's history of occasional dizziness and a minor fall could 
be related to various factors, including her anticoagulation therapy. Switching to a DOAC might offer 
a more stable and predictable anticoagulant effect, potentially reducing the risk of falls due to less 
variability in anticoagulation levels compared to Warfarin.

4. **Anticoagulation Management**: DOACs are known for their ease of use, with fewer dietary 
restrictions and less need for regular blood monitoring compared to Warfarin. This could improve 
Jane's quality of life and adherence to her anticoagulation regimen, which is crucial for effective 
stroke prevention in atrial fibrillation.

5. **Individualized Care**: The decision to switch from Warfarin to a DOAC should be made on a 
case-by-case basis, considering Jane's overall health status, the severity of her renal impairment, 
and her 
preferences. It's essential to weigh the benefits of reduced monitoring and potentially lower risk 
of bleeding against the need for careful renal function monitoring and the costs associated with DOACs. 

In conclusion, considering Jane Smith's age, medical history, and the latest evidence suggesting the 
safety and efficacy of DOACs in elderly patients, switching from Warfarin to a DOAC could be a 
reasonable option. However, this decision should be accompanied by close monitoring of her renal 
function and adjustment of the DOAC dose as necessary, along with education on the signs of bleeding 
or other adverse effects. A thorough discussion with Jane about the potential benefits and risks of 
this switch is also crucial to ensure her informed involvement in the decision-making process.

"""

The code above showcases a minimal yet powerful implementation of a hybrid RAG + CAG system tailored for clinical decision support. It combines two distinct knowledge sources: a static patient history (CAG) and dynamically retrieved medical literature (RAG).

The patient’s details are hardcoded to simulate a cache of known information that remains unchanged across questions. Meanwhile, the vector database built with FAISS and Hugging Face embeddings allows the system to semantically search and retrieve the most relevant studies to support evidence-based recommendations.

This organization not only improves interpretability but also primes the LLM to reason using both prior knowledge and the latest research. By sending the full prompt in one go, the model leverages its KV cache to hold all relevant context in memory, enabling a more accurate, justifiable, and personalized response.

6 - Final Thoughts

RAG and CAG are like two different lenses on the same telescope: each illuminating a different path in the journey of integrating external knowledge into LLMs.

Choosing one over the other (or combining both) depends on your specific use case, the nature of your information, and performance requirements.

In a world where knowledge is constantly evolving, knowing when to use RAG or CAG can be the key difference between a helpful assistant and a truly intelligent system.

RAG, CAG or Both?