In the world of generative AI, especially LLMs like GPT, one key limitation is their reliance on pre-trained knowledge. When you need real-time, domain-specific, or up-to-date information, Retrieval-Augmented Generation (RAG) provides a powerful architecture that bridges this gap.

This blog explores:

What is RAG?
Why RAG is needed
A step-by-step implementation using LangChain
PDF file ingestion and Qdrant vector storage

📘 What is RAG?

RAG stands for Retrieval-Augmented Generation. It’s a hybrid AI approach that combines two techniques:

Indexing: a pipeline for ingesting data from a source and indexing it. This usually happens offline.
Retrieval and generation: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model

❓ Why Do We Need RAG?

While LLMs are excellent at language generation, they do not have access to new or private information unless it was part of their training data.

Here’s why RAG is crucial:

✅ Up-to-date answers: It pulls current data from your documents or DB.
✅ Private data access: Enables responses based on internal PDFs, reports, or notes.
✅ Reduced hallucinations: The LLM bases its answers on real content, not guesses.
✅ Domain adaptation: Perfect for specialized industries—law, healthcare, education, etc.

🧭 Phases of RAG: Indexing and Retrieval

RAG operates in two main phases: Indexing and Retrieval.

📦 Phase 1: Indexing (Preprocessing)

Indexing happens before the user makes a query.

🔍 What happens during indexing?

Documents (e.g., PDFs, WebPages, etc) are loaded and split into chunks.
Each chunk is converted into embeddings using an embedding model.
These embeddings are stored in a vector database (like Qdrant).

🧠 Why Indexing is Important

Indexing is the foundation of RAG. Here's why it matters:

🚀 Fast Retrieval: Without indexed data, real-time semantic search isn't feasible.
🎯 Accurate Context: Clean, chunked, and embedded content ensures better query matching.
🔁 Reusability: Once indexed, the data can be queried unlimited times without reprocessing.
📊 Scalability: Enables handling large corpora of documents efficiently.

🔎 Phase 2: Retrieval (At Query Time)

Retrieval kicks in when a user enters a question.

⚙️ Steps in Retrieval:

The query is converted into an embedding vector.
The system searches the vector store (e.g., Qdrant) for semantically similar chunks.
These chunks are then passed to the LLM, which generates a final answer.

⚙️ Step-by-Step: RAG with LangChain, PDF Loader & Qdrant

Let’s break down the implementation.

🧱 1. Prerequisites

Install the necessary packages:

pip install langchain qdrant-client sentence-transformers pypdf

📄 2. Load PDF with LangChain's PDFLoader

Use LangChain’s built-in PDFLoader to extract content:

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example.pdf")
documents = loader.load()

This returns a list of Document objects, each containing metadata and text.

✂️ 3. Split Text into Chunks

LangChain recommends splitting documents into manageable chunks for embedding:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)

🔗 4. Embed the Chunks

We’ll convert the chunks into embeddings using Sentence Transformers:

from langchain.embeddings import SentenceTransformerEmbeddings
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-large"
)

🗃️ 5. Store Embeddings in Qdrant

Qdrant is a high-performance vector database for storing embeddings.

from langchain_qdrant import QdrantVectorStore

# Store vectors in Qdrant
vector_store = QdrantVectorStore.from_documents(
    documents=split_docs,
    url="http://localhost:6333",
    collection_name="learning_vectors",
    embedding=embedding_model
)

print("Indexing of Documents Done...")

🤖 6. Query with RAG Pipeline

Now we retrieve from Qdrant and generate answers using an LLM:

from dotenv import load_dotenv
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings
from openai import OpenAI

load_dotenv()

client = OpenAI()

# Vector Embeddings
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-large"
)

vector_db = QdrantVectorStore.from_existing_collection(
    url="http://localhost:6333",
    collection_name="learning_vectors",
    embedding=embedding_model
)

# Take User Query
query = input("> ")

# Vector Similarity Search [query] in DB
search_results = vector_db.similarity_search(
    query=query
)

context = "\n\n\n".join([f"Page Content: {result.page_content}\nPage Number: {result.metadata['page_label']}\nFile Location: {result.metadata['source']}" for result in search_results])

SYSTEM_PROMPT = f"""
    You are a helpfull AI Assistant who asnweres user query based on the available context
    retrieved from a PDF file along with page_contents and page number.

    You should only ans the user based on the following context and navigate the user
    to open the right page number to know more.

    Context:
    {context}
"""

chat_completion = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        { "role": "system", "content": SYSTEM_PROMPT },
        { "role": "user", "content": query },
    ]
)

print(f"🤖: {chat_completion.choices[0].message.content}")

🚀 Introduction to Retrieval-Augmented Generation (RAG) with LangChain and Qdrant