Introduction to Retrieval-Augmented Generation (RAG)

Ashish RautAshish Raut
4 min read

Overview

Retrieval‑Augmented Generation (RAG) is a way to combine the strengths of information retrieval (search) with large language models (LLMs) so that your model can answer questions grounded in external documents, even if it hasn’t “seen” them during training.

With Retrieval-Augmented Generation (RAG), you can upload documents, which are then processed into searchable chunks. When you ask a question, the system retrieves the most relevant chunks from your uploaded documents and combines them with your query.

Why RAG ?

Limitation of pure LLMs -

LLMs are trained on vast text corpora, but they can’t store every fact in their parameters. They may hallucinate (make up) details or be out of date.

Limitation of classic search -

Traditional search engines can retrieve relevant documents, but don’t generate fluent, conversational answers.

RAG = Retrieval + Generation

💡
First retrieve relevant bits of text, then let the LLM generate an answer grounded in those bits.

Core Components

  • Document Collection
    A set of texts you want to answer questions over: e.g. technical manuals, news articles, product FAQs.

  • Embedding Model
    A neural network that turns any text (queries and documents) into a fixed‑length vector in a “semantic” space.

  • Vector Store (Index)
    A database of document embeddings that supports fast similarity search (e.g. FAISS, Pinecone, Weaviate).

  • Retriever
    Given a user query, embeds it, then finds the top‑K most similar document chunks from the vector store.

  • Prompt Template / Combiner
    A prompt that combines user question + retrieved snippets into a single context for the LLM.

  • Generator (LLM)
    The language model (e.g. GPT‑4, Llama) that ingests the prompt and produces the final answer.

+-----------------+      +-----------------+
| User's Query    |----->| Retrieval       |
+-----------------+      | (Search over    |
                          | External Data)  |
                          +-----------------+
                                 ^ |
                                 | | (Relevant
                                 | | Documents)
                                 | v
+-----------------+      +-----------------+
| Retrieved       |<-----| External        |
| Documents       |      | Knowledge Base  |
+-----------------+      +-----------------+
          ^ |
          | | (Combined Query + Context)
          | v
+-----------------+
| Language Model  |
| Generation      |-----> Output Answer
+-----------------+

Workflow

  • User's Query: You ask a question.

  • Retrieval (Search over External Data): Your query is used to search through an external knowledge base (like a collection of documents).

  • External Knowledge Base: This is where your documents or other external information are stored and indexed for searching.

  • Retrieved Documents: The retrieval process identifies and pulls out the most relevant documents or chunks of information related to your query.

  • Combined Query + Context: The original query is combined with the content from the retrieved documents. This provides the language model with the necessary context.

  • Language Model Generation: The combined information is fed into a large language model.

  • Output Answer: The language model generates an answer that is informed by both its pre-existing knowledge and the information retrieved from the external knowledge base.

Benefits of RAG

  • Up‑to‑Date
    You can swap in fresh documents at any time—no retraining required.

  • Fact‑Grounded
    Reduces hallucinations by forcing the LLM to cite real snippets.

  • Scalable
    Handles huge corpora because retrieval narrows down to a few dozen tokens before generation.

From PDF to Answer

💡
Ask Your PDF Anything: A RAG-Based QA System with LangChain.

Loading the PDF

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader

pdf_path = Path(__file__).parent / "nodejs.pdf"
loader = PyPDFLoader(file_path=pdf_path)
docs = loader.load()

PyPDFLoader reads your PDF and represents it as one or more Document objects (with .page_content and metadata like page_number).

After loader.load(), docs is a list of full pages (or full PDFs, depending on loader internals).

Splitting the Chunks

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
split_docs = text_splitter.split_documents(documents=docs)

Why chunk? LLMs and vector‑embeddings work on fixed‑length inputs.

chunk_size=1000 tokens (roughly 750–1,000 words) ensures each piece isn’t too long for your embedding model.

chunk_overlap=200 means adjacent chunks share 200 characters, so you don’t accidentally cut a definition in half.

split_docs is now a list of many small Document objects, each ~1k characters long.

Creating Embeddings

from langchain_openai import OpenAIEmbeddings

embedder = OpenAIEmbeddings(
    model="text-embedding-3-large",
    api_key=""   # ← set via ENV var or secret manager; don’t hard‑code!
)

Pushing into Qdrant

vector_store = QdrantVectorStore.from_documents(
    documents=[],
    url="http://localhost:6333",
    collection_name="learning_langchain",
    embedding=embedder
)
vector_store.add_documents(documents=split_docs)
print("Injection Done")

from_documents - creates (and indexes) a new Qdrant collection called learning_langchain.

add_documents - pushes your split_docs into Qdrant, computing embeddings under the hood.

Once this runs once, your PDF’s text is stored in Qdrant for fast retrieval.

The print("Injection Done") lets you know indexing finished.

Loading the Vector Store & Searching

retriver = QdrantVectorStore.from_existing_collection(
    url="http://localhost:6333",
    collection_name="learning_langchain",
    embedding=embedder
)

search_result = retriver.similarity_search(
    query="What is FS Module?"
)
print("Relevant Chunks", search_result)

from_existing_collection - re‑uses the "learning_langchain" collection you built earlier.

similarity_search -

Embeds your query "What is FS Module?".

Finds the top‑K (default K=4) document chunks whose embeddings are closest in vector space.

search_result is a list of Document objects (with .page_content), ready to feed into your LLM.

Summary

RAG is important because it grounds LLM outputs in real-world data, reduces hallucinations, and enables applications that require up-to-date or domain-specific knowledge, significantly expanding the utility and reliability of these models.

0
Subscribe to my newsletter

Read articles from Ashish Raut directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ashish Raut
Ashish Raut