RAG: A Beginner's Guide to Retrieval-Augmented Generation


Have you ever asked a chatbot a question, only to get an outdated or made-up answer? Large Language Models (LLMs) like ChatGPT, Gemini, and Claude are very smart, but sometimes they're not up to date.
Retrieval-Augmented Generation (RAG) solves this problem by combining LLMs with real-time data retrieval, making responses more accurate and up-to-date. Retrieval-Augmented Generation (RAG) steps in to help AI give more accurate and current responses.
What is RAG?
RAG is a powerful technique that improves the capabilities of LLMs by fetching relevant information from an external knowledge source (like a database or the web) before generating a response. This integration allows the model to provide answers grounded in real-time data, enhancing the reliability and accuracy of its outputs.
Think of it like a student (the LLM) who doesn’t just rely on memory but also looks up facts in a textbook (the external database) before answering a question.
How Does RAG Work?
RAG follows a simple 3-step process:
Retrieval: Find Relevant Information When a user asks a question, RAG looks through a knowledge base (like a vector database) for relevant documents. It uses semantic search to find the best content, not just keyword matching.
Augmentation: Add the Retrieved Data to the Question. The documents found are combined with the user's original question. This gives the LLM more context.
Generation: Create a Better Answer. The LLM then makes a response using both its pre-trained knowledge and the new data. This leads to more accurate and fact-based answers.
Working of RAG
Document: The process starts with a document or a group of documents that hold the information the system will use.
Chunking Process: These documents are divided into smaller parts called chunks to make them easier to manage and search through efficiently.
Embedding Model: Each chunk is passed through an embedding model, which converts the text into high-dimensional vectors that represent semantic meaning.
Vector Store (Database): The embeddings are stored in a vector database(like pinecone, Chroma, Weaviate, Qdrant), which allows fast and accurate similarity-based retrieval.
User Prompt: A user submits a query or prompt. This input is also embedded using the same embedding model.
Retriever: The system compares the embedded user query with the stored vectors and retrieves the most relevant chunks from the database.
LLM (Large Language Model): The retrieved chunks are sent to the LLM along with the original prompt. This is the prompt augmentation phase, where external knowledge is combined with the question.
Response Output: The LLM generates an answer using both the query and the retrieved context. This output is then presented to the user.
Here a simple RAG made using LangChain
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
pdf_path = Path(__file__).parent / "file_name.pdf"
loader = PyPDFLoader(file_path = pdf_path)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap = 200,
)
split_docs = text_splitter.split_documents(documents = docs)
embedder = OpenAIEmbeddings(
model = "text-embedding-3-large",
api_key = "OPENAI_API_KEY",
)
vector_store = QdrantVectorStore.from_documents(
documents = [],
url = "https://localhost:6333",
collection_name = "collection_name",
embedding = embedder
)
vector_store.add_documents(documents = split_docs)
print("Injection Done")
retriever = QdrantVectorStore.from_existing_collection(
url = "https://localhost:6333",
collection_name = "collection_name",
embedding = embedder
)
relevant_chunks = retriever.similarity_search(
query = "user_query",
)
SYSTEM_PROMPT = f"""
You are a helpful assistant that responds based on the given context.
Context:
{relevant_chunks}
"""
Conclusion
RAG is a breakthrough for AI applications, making LLMs smarter by mixing their reasoning with real-world data. Whether you're creating a chatbot, research tool, or business assistant, RAG helps provide accurate, current, and reliable answers.
Next Steps
Experiment with LangChain (popular RAG frameworks).
Try embedding models (Openai, Hugging Face).
Explore vector databases (Pinecone, qdrant, Weaviate, FAISS).
Subscribe to my newsletter
Read articles from Mahesh Kunwar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Mahesh Kunwar
Mahesh Kunwar
I am Mahesh Kunwar, a dedicated web developer specializing in the MERN stack—MongoDB, Express.js, React.js, and Node.js. Passionate about creating user-friendly and efficient applications, I enjoy solving complex problems through clean and scalable code. My journey in web development has been shaped by hands-on projects that have enhanced my problem-solving skills and deepened my technical expertise. With a dynamic mindset and a strong enthusiasm for technology, I am always eager to learn new technologies and adapt to emerging trends in the fast-evolving tech landscape. A quick learner, I thrive in collaborative environments, constantly seeking opportunities to contribute, grow, and innovate. My goal is to leverage my skills to develop impactful software solutions that enhance user experiences and drive meaningful progress.