RAG: Unlocking the Power of Retrieval-Augmented Generation 🚀

In the AI world, Retrieval-Augmented Generation (RAG) has emerged as a groundbreaking approach that combines the strengths of retrieval systems with generative models. This hybrid technique enhances the capabilities of AI systems, allowing them to deliver more accurate and contextually relevant responses.
What is RAG?
RAG is a framework that provide an interface of information retrieval with generative AI models. By retrieving relevant documents or snippets from a huge documents, web pages or data from imaged, RAG provides additional context to the generative model, enabling it to produce more informed and accurate outputs.
Why Do We Need RAG?
1. Enhanced Contextual Understanding
Traditional generative models, like GPT, can generate coherent and fluent text but often lack specific knowledge or context. RAG addresses this by retrieving relevant information that the model can use to generate responses grounded in factual data.
2. Improved Accuracy
RAG enhances the accuracy of AI-generated responses by providing the model with access to a dynamic knowledge base specifically for users who are mainly interested in there project specific data which. This is particularly beneficial in domains where up-to-date information is critical.
3. Scalability
As the volume of available information grows, RAG enables systems to scale effectively by leveraging retrieval mechanisms to filter relevant data, ensuring that the generative model processes only the most pertinent information.
RAG Architecture :
🏗️ RAG Architecture Overview
1. User Interaction (Prompt + Query)
The user inputs a question through an interface (CLI, chatbot, API).
The system captures this intent to trigger the retrieval and generation process.
2. Semantic Search via Vector Database
The input query is converted into a vector embedding.
This vector is matched against stored vectors in a database like Qdrant to retrieve semantically similar chunks—not just keyword matches.
3. Prompt Construction with Retrieved Context
Retrieved content is structured into a single prompt along with the user’s query.
This structured prompt ensures the LLM has the context it needs to answer correctly.
4. Response Generation by the LLM
The LLM processes the composite prompt and generates a relevant, grounded answer.
The quality and correctness of this answer heavily depend on the quality of the retrieved chunks.
5. Post-Processing
- The raw output is polished—e.g., grammar checked, filtered, or restructured before showing it to the user.
How RAG do the vector embedding?
As we know what is RAG and why we need it, but how does RAG help LLM model and on what basis it will help user to respond the correct information. Lets understand.
When user ask some query based on some pre-trained data, before that someone has to upload that to GPT database where LLM should understand I need to respond based on this data.
So, ideally, in most of the cases, we can see that there is an option to upload some files, images to ChatBot. Its nothing but we are feeding data to the behind the scene GPT model which will be vectorised using some of the Vector embedding techniques where it breakdown the input eg PDF file in small small chunks and create tokens in vector space.
Lets understand the process of chunking?
When we upload a file to GPT,its very hard to read the file and share the data based on user input, rather what we can do is we can split the PDF file based on page numbers or paragraph.
đź§®Indexing/Ingestion:
Data Organization: Imagine you’re the little girl surrounded by textbooks. Now, she might get irritated to find the information that has to be relevant from one of the book eg she want to learn something about advanced mathetics concepts or she might be interested in some comics based math learning. Now, lets break down the problem: Take each of the book and break it down into few bite size , eg one might be comics based math, one might be scalar math and so on.
Each of these bite is a vector which is like an address or reference of that book and subsequently, each bite or page is another vector which will be a reference of the particular topic.
Vector Creation: Now, Each of these chunks will get passed through an embedding model, a type of model that creates a vector representation of hundreds or thousands of numbers that encapsulate the meaning of the information and it get stored in Vector Database.
The model assigns a unique vector to each chunk—sort of like creating a unique index that a computer can understand. This is known as the indexing stage.
🔎 Retrieval:
Now, we have a user query for example , little girl asked for comic based math , so this user prompt will be searched in vector database and based on chunks of the data, user will get response back wherein user query also be created as one of the vector but it will be useful to search the user prompt on vector space where query and pre-trained data will get match. This is called as RAG pipeline.
<———RAG Pipeline———>
In above case, focus on the retrieval flow which is connected to ingestion flow where retrieval process have below steps:
User Query: In this case, eg A little girl want to know the books that are comic based mathematics and pass the query to GPT.
GPT will breakdown the user query again into vector embedding so that some specific keywords like “comic based mathematics“ which are relevant tokens will get generate .
Now, GPT will search the VE that was created for user qeury in indexed Vector database
Vector database will return respons to RAG model based on the tokens that was nearby to what it has stored in Vector DB.
Now, in 5th stage, RAG model will club the user query and response that it received from vector database and will feed that to LLM model which will finetune the reponse and provide appropriate result.
eg Below image taken from one of the medium blog for more reference:
Lets code:
For example, lets try to search something from nodeJS PDF file which can be downloaded from nodejs website.
Ingestion:
We will be using Langchain here to work with files. LangChain is a powerful open-source framework designed to help developers build applications powered by large language models (LLMs). It simplifies the process of connecting LLMs to various data sources, creating complex workflows, and ultimately, deploying them in real-world applications.This is kind of utility which provide numerous operations with predefined modules.
Now, to load PDF file using Langchain lets search for “Langchain Document Loader“
https://python.langchain.com/docs/integrations/document_loaders/pypdfloader/
Install the Langchain as per documentation using pip where Langchain PyPDF will help to load the PDF file and convertPDF file into text format.
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
pdf_file = Path(__file__).parent / "nodejs.pdf"
loader = PyPDFLoader(file_path=pdf_file)
docs = loader.load()
print("Docs", docs[0])
If we run the PDF file, we can get entire file, so lets print 1st element of file.
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG % /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/.venv/bin/pyth
on /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/my_rag.py
Ignoring wrong pointing object 268 0 (offset 0)
Ignoring wrong pointing object 309 0 (offset 0)
Docs page_content='A PDF Reference for The Complete Node.js Dev Course Version 3.0' metadata={'producer': 'macOS Version 10.14.1 (Build 18B75) Quartz PDFContext', 'creator': 'Acrobat PDFMaker 17 for Word', 'creationdate': "D:20190227140340Z00'00'", 'author': 'Andrew Mead', 'moddate': "D:20190227140340Z00'00'", 'source': '/Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/nodejs.pdf', 'total_pages': 125, 'page': 0, 'page_label': '1'}
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG %
Now, it has created page by page array, lets work on next stage :
Chunking:
Lets use langchain library only to create chunking process. For this , we can use RecursivetextSplitters.
https://python.langchain.com/docs/concepts/text_splitters/
#chunking
text_splitter =RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=400
)
splitted_docs=text_splitter.split_documents(documents=docs)
print("texts", splitted_docs[1])
Text splitter will split the docs that has been ingested and the splitting will happen based on chunk_size eg we have taken chunk_size=1000 and chunk_overlap=200.
Chunk_overlap will make sure that there is always a CONTEXT established between the first chunk and subsequent chunk so that there is less or no possibility of not having relevant context for data search.
O/P of above code snippet with 1st indexed element:
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG % /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/.venv/bin/python /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/my_rag.py
Ignoring wrong pointing object 268 0 (offset 0)
Ignoring wrong pointing object 309 0 (offset 0)
texts page_content='Version 1.0 2
Section 1: Welcome ................................................................................................................... 8
Section 2: Installing and Exploring Node.js .......................................................................... 8
Lesson 1: Section Intro ....................................................................................................................... 8
Lesson 2: Installing Node.js and Visual Studio Code ............................................................... 8
Lesson 3: What is Node.js? .............................................................................................................. 8
Lesson 4: Why Should I Use Node.js? ........................................................................................... 9
Lesson 5: Your First Node.js Script ................................................................................................ 9' metadata={'producer': 'macOS Version 10.14.1 (Build 18B75) Quartz PDFContext', 'creator': 'Acrobat PDFMaker 17 for Word', 'creationdate': "D:20190227140340Z00'00'", 'author': 'Andrew Mead', 'moddate': "D:20190227140340Z00'00'", 'source': '/Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/nodejs.pdf', 'total_pages': 125, 'page': 1, 'page_label': '2'}
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG %
O/P of above code snippet with 2nd indexed element:
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG % /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/.venv/bin/python /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/my_rag.py
Ignoring wrong pointing object 268 0 (offset 0)
Ignoring wrong pointing object 309 0 (offset 0)
texts page_content='Lesson 4: Why Should I Use Node.js? ........................................................................................... 9
Lesson 5: Your First Node.js Script ................................................................................................ 9
Section 3: Node.js Module System........................................................................................ 10
Lesson 1: Section Intro ...................................................................................................................... 10
Lesson 2: Importing Node.js Core Modules ............................................................................... 10
Lesson 3: Importing Your Own Files .............................................................................................. 11
Lesson 4: Importing npm Modules ................................................................................................ 12' metadata={'producer': 'macOS Version 10.14.1 (Build 18B75) Quartz PDFContext', 'creator': 'Acrobat PDFMaker 17 for Word', 'creationdate': "D:20190227140340Z00'00'", 'author': 'Andrew Mead', 'moddate': "D:20190227140340Z00'00'", 'source': '/Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/nodejs.pdf', 'total_pages': 125, 'page': 1, 'page_label': '2'}
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG %
If you see the Lesson 5 is repeated , in short we have a context set between the chunk at index element 1 and index element 2. This is why chunk_overlap important.
Vector Embedding
Now, we have to create vector embeddings. Vector embeddings will convert the text into some token which can stored in vector database. To create vector embedding, we can again use Langchain
https://python.langchain.com/docs/integrations/text_embedding/openai/
pip install -qU langchain-openai
from langchain_openai import OpenAIEmbeddings
#Vector embedding
#In this stahe you will require to have an openAI secret key to utilize the openai embedding model.
embeddings = OpenAIEmbeddings(
model="text-embedding-3-large"
)
Now, we just created a embedding call but we are more interested in to create the vector embedding of splitters that has been created in chunking stage and load in vector database.
To achieve this, we will be using Qdrant DB which is open-source and have UI where we can see the collections of embeddings.
You can have qdrant DB running locally or you can user docker desktop and run the qdrant DB image which is easily available on dockerhub.
we can create below docker-compose and run the qdrant image as with service name vector_db
services:
vector-db:
image: qdrant/qdrant
ports:
- 6333:6333
Start the service in background and can be accessible http://localhost:6333/dashboard#/welcome:
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG % docker-compose -f docker-compose.yaml up -d
[+] Running 1/1
âś” Container 05-rag-vector-db-1 Started 0.1s
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG %
Now, we have QDRANT DB running and to store data in qdrant DB we can again use Langchain
https://qdrant.tech/documentation/frameworks/langchain/ install langchain qdrant and import
from langchain_qdrant import QdrantVectorStore to store vector embeddings in database.
As this is learning blog, we will use In-Memory data store.
#using embedding model, create embedding of splitter and store in vector databse.
vector_store = QdrantVectorStore.from_documents(
documents=splitted_docs,
url="http://localhost:6333",
collection_name="vector_learning",
embedding=embeddings
)
print("Indexing of splited document is completed")
Output:
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG % /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/.venv/bin/pyth
on /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/my_rag.py
Ignoring wrong pointing object 268 0 (offset 0)
Ignoring wrong pointing object 309 0 (offset 0)
texts page_content='Lesson 4: Why Should I Use Node.js? ........................................................................................... 9
Lesson 5: Your First Node.js Script ................................................................................................ 9
Section 3: Node.js Module System........................................................................................ 10
Lesson 1: Section Intro ...................................................................................................................... 10
Lesson 2: Importing Node.js Core Modules ............................................................................... 10
Lesson 3: Importing Your Own Files .............................................................................................. 11
Lesson 4: Importing npm Modules ................................................................................................ 12' metadata={'producer': 'macOS Version 10.14.1 (Build 18B75) Quartz PDFContext', 'creator': 'Acrobat PDFMaker 17 for Word', 'creationdate': "D:20190227140340Z00'00'", 'author': 'Andrew Mead', 'moddate': "D:20190227140340Z00'00'", 'source': '/Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/nodejs.pdf', 'total_pages': 125, 'page': 1, 'page_label': '2'}
Indexing of splited document is completed
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG %
We can see collection also in Vector DB with each array of size 3072 words.
This is one vector for reference :
Now, we are done with creating embeddings and storing data in vector db. Now, lets call OpenAI API to interact using command line:
What we are doing below:
Creayed Embedding Model:
- Uses OpenAI's
"text-embedding-3-large"
model to convert text to vector embeddings.
- Uses OpenAI's
Qdrant Vector Store Connection:
- Connects to an existing Qdrant collection called
"vector_learning"
on a local server (http://localhost:6333
).
- Connects to an existing Qdrant collection called
User Input:
- Prompts the user to enter a query.
Vector Search:
- Performs a similarity search in the vector database using the embedded version of the user's query.
from dotenv import load_dotenv
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings
from openai import OpenAI
import json
load_dotenv()
client = OpenAI()
# Vector Embeddings
embedding_model = OpenAIEmbeddings(
model="text-embedding-3-large"
)
vector_db = QdrantVectorStore.from_existing_collection(
url="http://localhost:6333",
collection_name="vector_learning",
embedding=embedding_model
)
# Take User Query
query = input("> ")
# Vector similarity search in existing vector DB
search_results= vector_db.similarity_search(
query=query
)
#print(search_results)
# Convert results to JSON-serializable format
formatted_results = []
for result in search_results:
formatted_results.append({
"content": result.page_content,
"metadata": result.metadata
})
# Print results in JSON format
print(json.dumps(formatted_results, indent=2))
If we execute this code, we can see the result pertinent to user query with page detail:
example “What is callback function in nodeJS?”
venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG % /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/.venv/bin/python /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/chat.py
> what is promise callback function in NodeJS?
[
{
"content": "Version 1.0 31 \nCallback functions are at the core of asynchronous development. When you perform an \nasynchronous operation, you\u2019ll provide Node with a callback function. Node will then call \nthe callback when the async operation is complete. This is how you get access to the \nresults of the async operation, whether it\u2019s an HTTP request for JSON data or a query to a \ndatabase for a user\u2019s profile. \nThe example below shows how you can use the callback pattern in your own code. The \ngeocode function is set up to take in two arguments. The first is the address to geocode. \nThe second is the callback function to run when the geocoding process is complete. The \nexample below simulates this request by using setTimeout to make the process \nasynchronous. \nconst geocode = (address, callback) => { \n setTimeout(() => { \n const data = { \n latitude: 0, \n longitude: 0 \n } \n \n callback(data) \n }, 2000) \n} \n \ngeocode('Philadelphia', (data) => {",
"metadata": {
"producer": "macOS Version 10.14.1 (Build 18B75) Quartz PDFContext",
"creator": "Acrobat PDFMaker 17 for Word",
"creationdate": "D:20190227140340Z00'00'",
"author": "Andrew Mead",
"moddate": "D:20190227140340Z00'00'",
"source": "/Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/nodejs.pdf",
"total_pages": 125,
"page": 30,
"page_label": "31",
"_id": "200a978f-c542-4ae4-972c-448e4739bb6c",
"_collection_name": "vector_learning"
}
},
{
"content": "Version 1.0 32 \nCallback Abstraction \nImagine you want to geocode an address from multiple places in your application. You \nhave two options. Option one, you can duplicate the code responsible for making the \nrequest. This includes the call to request along with all the code responsible for handling \nerrors. However, this isn\u2019t ideal. Duplicating code makes your application unnecessarily \ncomplex and difficult to maintain. The solution is to create a single reusable function that \ncan be called whenever you need to geocode an address. \nYou can see an example of this below. The function geocode was created to serve as a \nreusable way to geocode an address. It contains all the logic necessary to make the \nrequest and process the response. geocode accepts two arguments. The first is the \naddress to geocode. The second is a callback function which will run once the geocoding \noperation is complete. \nconst request = require('request') \n \nconst geocode = (address, callback) => {",
"metadata": {
"producer": "macOS Version 10.14.1 (Build 18B75) Quartz PDFContext",
"creator": "Acrobat PDFMaker 17 for Word",
"creationdate": "D:20190227140340Z00'00'",
"author": "Andrew Mead",
"moddate": "D:20190227140340Z00'00'",
"source": "/Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/nodejs.pdf",
"total_pages": 125,
"page": 31,
"page_label": "32",
"_id": "3d26dafa-6f2d-4c1e-99af-e807193fe153",
"_collection_name": "vector_learning"
}
},
{
"content": "for working with promises. You\u2019ll be able to write complex asynchronous code that looks \nlike normal synchronous code. This makes it much easier to write and maintain \nasynchronous code. \nExploring Async/Await \nThe example below uses the add function that was created two lessons ago. \nThe first step to using async and await is to create an asynchronous function. This is done \nusing the async keyword before the function definition. This can be seen in the definition \nof doWork below. Any function can be defined as an asynchronous function, not just arrow \nfunctions. \nWith an async function in place, you can now use the await operator. The await operator \ncan only be used inside of asynchronous functions. This removes the need for excess \ncallbacks and makes code much easier to read. \nThe await operator is used with promises in asynchronous functions. You can see this \nused three times in doWork. The await operator allows you to work with promises in a way",
"metadata": {
"producer": "macOS Version 10.14.1 (Build 18B75) Quartz PDFContext",
"creator": "Acrobat PDFMaker 17 for Word",
"creationdate": "D:20190227140340Z00'00'",
"author": "Andrew Mead",
"moddate": "D:20190227140340Z00'00'",
"source": "/Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/nodejs.pdf",
"total_pages": 125,
"page": 72,
"page_label": "73",
"_id": "32dcd297-b620-43a7-9c7c-092786bcf48a",
"_collection_name": "vector_learning"
}
},
{
"content": "} else if (response.body.features.length === 0) { \n console.log('Unable to find location. Try another search.') \n } else { \n const latitude = response.body.features[0].center[0] \n const longitude = response.body.features[0].center[1] \n console.log(latitude, longitude) \n } \n}) \nLesson 8: The Callback Function \nA callback function is a function that\u2019s passed as an argument to another function. That\u2019s it. \nThis is something you\u2019ve used before, and in this lesson, you\u2019ll dive a bit deeper into how \nthey work. \nThe Callback Function \nA callback function is a function that\u2019s passed as an argument to another function. Imagine \nyou have FunctionA which gets passed as an argument to FunctionB. FunctionB will do \nsome work and then call FunctionA at some point in the future.",
"metadata": {
"producer": "macOS Version 10.14.1 (Build 18B75) Quartz PDFContext",
"creator": "Acrobat PDFMaker 17 for Word",
"creationdate": "D:20190227140340Z00'00'",
"author": "Andrew Mead",
"moddate": "D:20190227140340Z00'00'",
"source": "/Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/nodejs.pdf",
"total_pages": 125,
"page": 29,
"page_label": "30",
"_id": "22efa58f-dd58-483d-a650-dce0f8a212f9",
"_collection_name": "vector_learning"
}
}
]
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG %
if you open the pdf file manually, you can see similar data: eg "page": 30,
Now, lets create SYSTEM_PROMPT so that we can cache the user request and train the LLM to answer based on our prompt where context is based on page number, page data and metadata using for loop.
context = "\n\n\n".join([f"Page Content: {result.page_content}\nPage Number: {result.metadata['page_label']}\nFile Location: {result.metadata['source']}" for result in search_results])
SYSTEM_PROMPT = f"""
You are a helpfull AI Assistant who asnweres user query based on the available context
retrieved from a PDF file along with page_contents and page number.
You should only ans the user based on the following context and navigate the user
to open the right page number to know more.
Context:
{context}
"""
chat_completion = client.chat.completions.create(
model="gpt-4.1",
messages=[
{ "role": "system", "content": SYSTEM_PROMPT },
{ "role": "user", "content": query },
]
)
print(f"🤖: {chat_completion.choices[0].message.content}")
Now, if we run the chat.py , we can see crisp and clear answer for CallBack function in NodeJS with consolidated format.
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG % /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/.venv/bin/python /Users/kimu-duru/Documents/python_scripting_udemy_narendra/genai-cohort-2/05-RAG/chat.py
> what is callback function in NodeJS?
🤖: A callback function in NodeJS is a function that’s passed as an argument to another function. Node uses callback functions heavily, especially for handling asynchronous operations. For example, when you perform an asynchronous operation like an HTTP request or a database query, you provide a callback function that Node will call once the operation is complete. This allows you to work with the results of the async operation.
For more details and examples, please refer to page 30 of the document.
.venv(base) kimu-duru@Dhandes-MacBook-Pro 05-RAG %
Conclusion
In this blog, we explored the foundational concepts and practical implementation of Retrieval-Augmented Generation (RAG)—a powerful technique that bridges information retrieval with the generative capabilities of large language models. RAG empowers AI to provide more accurate, context-aware, and project-specific responses by retrieving relevant data and feeding it directly into the generation pipeline.
We walked through:
What RAG is and why it's essential.
The complete RAG architecture and workflow—from user query to vector search and LLM output.
A hands-on example using Node.js documentation, LangChain, Qdrant vector database, and OpenAI embeddings to build a working RAG pipeline.
By combining the strengths of semantic search with generative AI, RAG allows developers to customize knowledge domains, scale access to large document sets, and deliver answers grounded in real data. Whether you're building internal knowledge tools or domain-specific assistants, RAG is a cornerstone for building intelligent, context-rich systems.
#chaicode #RAG #genai
Subscribe to my newsletter
Read articles from mayur dhande directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
