What is RAG?

RAG stands for Retrieval Augmented Generation.
RAG help developers combining components - like retrieval (fetching relevant external information) and generation (using an LLM like GPT to produce answers).
Example: Instead of relying only on a language model’s training, it retrieves real-time or domain-specific data to improve relevance and accuracy.

Why RAG?

There are context limitations in LLMs (can’t be able to store all knowledge).
RAG keeps the models lightweight and updatable.

Let’s understand Context Window

The context window in an LLM is the amount of text (input) it can see and consider when generating a response.

Older models of LLMs had small window (2k-4k tokens), while newer ones can handle up to 128k tokens (roughly a 300-page book).
But you can’t keep feeding all your data into the context every time - it’s inefficient and costly.
That’s why modern LLMs are focusing more on expanding the context windows and smarter techniques like RAG - which avoids stuffing all data and instead retrieves only the most relevant info, keeping things fast, scalable, and accurate.

Where are Vector Embeddings used in RAG?

Vector embeddings convert text into numerical vectors that capture meaning. Similar texts have similar vectors (semantic meanings).
Example: Capital of India and New Delhi will have close embeddings.
In RAG, we embed both user query and documents, then search for documents with the closest vectors to retrieve relevant info.

Let’s understand the architecture of RAG

Preprocessing Phase (done by AI engineers):

Data Collection → Gather data (PDFs, docs, CSVs, etc.) relevant to your domain.
Chunking → Break large texts into smaller, meaningful pieces (e.g., paragraphs).
Embedding Documents → Convert each chunk into a vector using an embedding model (e.g., OpenAI).
Store in Vector DB → Save those vectors in a vector database like Pinecone or Qdrant.

Query-Time Phase (when the user interacts):

User Query → User asks a question or gives a prompt.
Query Embedding → The query is embedded into a vector.
Retrieval → Search the vector DB for the most relevant document chucks.
Context Assembly → Retrieved chucks are appended to the prompt.
Generation → Ann LLM generates the answer using the enriched context.

What is LangChain and Why it is used?

LangChain is a Python/JS framework to build LLM applications, especially RAG pipelines.
It gives ready-made tools for:
- Loading documents
- Creating embeddings
- Connecting to vector stores
- Chaining prompts and responses
Example: Load a PDF → Split it → Embed it → Store in Qdrant DB → Retrieve on query → Send to GPT → Done!

  from pathlib import Path
  from langchain_community.document_loaders import PyPDFLoader
  from langchain_text_splitters import RecursiveCharacterTextSplitter
  from langchain_openai import OpenAIEmbeddings
  from langchain_qdrant import QdrantVectorStore

  pdf_path = Path(__file__).parent / "file_name.pdf"

  loader = PyPDFLoader(file_path = pdf_path)
  docs = loader.load()

  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = 1000,
      chunk_overlap = 200,
  )

  split_docs = text_splitter.split_documents(documents = docs)

  embedder = OpenAIEmbeddings(
      model = "text-embedding-3-large",
      api_key = "OPENAI_API_KEY",
  )

  vector_store = QdrantVectorStore.from_documents(
      documents = [],
      url = "https://localhost:6333",
      collection_name = "collection_name",
      embedding = embedder
  )

  vector_store.add_documents(documents = split_docs)
  print("Injection Done")

  retriever = QdrantVectorStore.from_existing_collection(
      url = "https://localhost:6333",
      collection_name = "collection_name",
      embedding = embedder
  )

  relevant_chunks = retriever.similarity_search(
      query = "user_query",
  )

  SYSTEM_PROMPT = f"""
  You are a helpful assistant that responds based on the given context.

  Context:
  {relevant_chunks}
  """

You can now use the required LLM to provide an enriched answer to the user.

Connect:

Let’s connect and explore the Generative AI together - Linkedin

RAG for Devs

What is RAG?

Why RAG?

Let’s understand Context Window

Where are Vector Embeddings used in RAG?

Let’s understand the architecture of RAG

Preprocessing Phase (done by AI engineers):

Query-Time Phase (when the user interacts):

What is LangChain and Why it is used?

Connect:

Subscribe to my newsletter

Ravi Mistry

Ravi Mistry