Introduction to Retrieval-Augmented Generation (RAG)


Overview
Retrieval‑Augmented Generation (RAG) is a way to combine the strengths of information retrieval (search) with large language models (LLMs) so that your model can answer questions grounded in external documents, even if it hasn’t “seen” them during training.
With Retrieval-Augmented Generation (RAG), you can upload documents, which are then processed into searchable chunks. When you ask a question, the system retrieves the most relevant chunks from your uploaded documents and combines them with your query.
Why RAG ?
Limitation of pure LLMs -
LLMs are trained on vast text corpora, but they can’t store every fact in their parameters. They may hallucinate (make up) details or be out of date.
Limitation of classic search -
Traditional search engines can retrieve relevant documents, but don’t generate fluent, conversational answers.
RAG = Retrieval + Generation
Core Components
Document Collection
A set of texts you want to answer questions over: e.g. technical manuals, news articles, product FAQs.Embedding Model
A neural network that turns any text (queries and documents) into a fixed‑length vector in a “semantic” space.Vector Store (Index)
A database of document embeddings that supports fast similarity search (e.g. FAISS, Pinecone, Weaviate).Retriever
Given a user query, embeds it, then finds the top‑K most similar document chunks from the vector store.Prompt Template / Combiner
A prompt that combines user question + retrieved snippets into a single context for the LLM.Generator (LLM)
The language model (e.g. GPT‑4, Llama) that ingests the prompt and produces the final answer.
+-----------------+ +-----------------+
| User's Query |----->| Retrieval |
+-----------------+ | (Search over |
| External Data) |
+-----------------+
^ |
| | (Relevant
| | Documents)
| v
+-----------------+ +-----------------+
| Retrieved |<-----| External |
| Documents | | Knowledge Base |
+-----------------+ +-----------------+
^ |
| | (Combined Query + Context)
| v
+-----------------+
| Language Model |
| Generation |-----> Output Answer
+-----------------+
Workflow
User's Query: You ask a question.
Retrieval (Search over External Data): Your query is used to search through an external knowledge base (like a collection of documents).
External Knowledge Base: This is where your documents or other external information are stored and indexed for searching.
Retrieved Documents: The retrieval process identifies and pulls out the most relevant documents or chunks of information related to your query.
Combined Query + Context: The original query is combined with the content from the retrieved documents. This provides the language model with the necessary context.
Language Model Generation: The combined information is fed into a large language model.
Output Answer: The language model generates an answer that is informed by both its pre-existing knowledge and the information retrieved from the external knowledge base.
Benefits of RAG
Up‑to‑Date
You can swap in fresh documents at any time—no retraining required.Fact‑Grounded
Reduces hallucinations by forcing the LLM to cite real snippets.Scalable
Handles huge corpora because retrieval narrows down to a few dozen tokens before generation.
From PDF to Answer
Loading the PDF
from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
pdf_path = Path(__file__).parent / "nodejs.pdf"
loader = PyPDFLoader(file_path=pdf_path)
docs = loader.load()
PyPDFLoader
reads your PDF and represents it as one or more Document objects (with .page_content and metadata like page_number).
After loader.load()
, docs is a list of full pages (or full PDFs, depending on loader internals).
Splitting the Chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
split_docs = text_splitter.split_documents(documents=docs)
Why chunk? LLMs and vector‑embeddings work on fixed‑length inputs.
chunk_size=1000
tokens (roughly 750–1,000 words) ensures each piece isn’t too long for your embedding model.
chunk_overlap=200
means adjacent chunks share 200 characters, so you don’t accidentally cut a definition in half.
split_docs is now a list of many small Document objects, each ~1k characters long.
Creating Embeddings
from langchain_openai import OpenAIEmbeddings
embedder = OpenAIEmbeddings(
model="text-embedding-3-large",
api_key="" # ← set via ENV var or secret manager; don’t hard‑code!
)
Pushing into Qdrant
vector_store = QdrantVectorStore.from_documents(
documents=[],
url="http://localhost:6333",
collection_name="learning_langchain",
embedding=embedder
)
vector_store.add_documents(documents=split_docs)
print("Injection Done")
from_documents
- creates (and indexes) a new Qdrant collection called learning_langchain
.
add_documents
- pushes your split_docs
into Qdrant, computing embeddings under the hood.
Once this runs once, your PDF’s text is stored in Qdrant for fast retrieval.
The print("Injection Done")
lets you know indexing finished.
Loading the Vector Store & Searching
retriver = QdrantVectorStore.from_existing_collection(
url="http://localhost:6333",
collection_name="learning_langchain",
embedding=embedder
)
search_result = retriver.similarity_search(
query="What is FS Module?"
)
print("Relevant Chunks", search_result)
from_existing_collection
- re‑uses the "learning_langchain" collection you built earlier.
similarity_search
-
Embeds your query "What is FS Module?".
Finds the top‑K (default K=4) document chunks whose embeddings are closest in vector space.
search_result
is a list of Document objects (with .page_content
), ready to feed into your LLM.
Summary
RAG is important because it grounds LLM outputs in real-world data, reduces hallucinations, and enables applications that require up-to-date or domain-specific knowledge, significantly expanding the utility and reliability of these models.
Subscribe to my newsletter
Read articles from Ashish Raut directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
