RAG for Devs

Ravi MistryRavi Mistry
3 min read

What is RAG?

  • RAG stands for Retrieval Augmented Generation.

  • RAG help developers combining components - like retrieval (fetching relevant external information) and generation (using an LLM like GPT to produce answers).

  • Example: Instead of relying only on a language model’s training, it retrieves real-time or domain-specific data to improve relevance and accuracy.

Why RAG?

  • There are context limitations in LLMs (can’t be able to store all knowledge).

  • RAG keeps the models lightweight and updatable.


Let’s understand Context Window

  • The context window in an LLM is the amount of text (input) it can see and consider when generating a response.

  • Older models of LLMs had small window (2k-4k tokens), while newer ones can handle up to 128k tokens (roughly a 300-page book).

  • But you can’t keep feeding all your data into the context every time - it’s inefficient and costly.

  • That’s why modern LLMs are focusing more on expanding the context windows and smarter techniques like RAG - which avoids stuffing all data and instead retrieves only the most relevant info, keeping things fast, scalable, and accurate.


Where are Vector Embeddings used in RAG?

  • Vector embeddings convert text into numerical vectors that capture meaning. Similar texts have similar vectors (semantic meanings).

  • Example: Capital of India and New Delhi will have close embeddings.

  • In RAG, we embed both user query and documents, then search for documents with the closest vectors to retrieve relevant info.


Let’s understand the architecture of RAG

Preprocessing Phase (done by AI engineers):

  1. Data Collection → Gather data (PDFs, docs, CSVs, etc.) relevant to your domain.

  2. Chunking → Break large texts into smaller, meaningful pieces (e.g., paragraphs).

  3. Embedding Documents → Convert each chunk into a vector using an embedding model (e.g., OpenAI).

  4. Store in Vector DB → Save those vectors in a vector database like Pinecone or Qdrant.

Query-Time Phase (when the user interacts):

  1. User Query → User asks a question or gives a prompt.

  2. Query Embedding → The query is embedded into a vector.

  3. Retrieval → Search the vector DB for the most relevant document chucks.

  4. Context Assembly → Retrieved chucks are appended to the prompt.

  5. Generation → Ann LLM generates the answer using the enriched context.


What is LangChain and Why it is used?

  • LangChain is a Python/JS framework to build LLM applications, especially RAG pipelines.

  • It gives ready-made tools for:

    • Loading documents

    • Creating embeddings

    • Connecting to vector stores

    • Chaining prompts and responses

  • Example: Load a PDF → Split it → Embed it → Store in Qdrant DB → Retrieve on query → Send to GPT → Done!

  •   from pathlib import Path
      from langchain_community.document_loaders import PyPDFLoader
      from langchain_text_splitters import RecursiveCharacterTextSplitter
      from langchain_openai import OpenAIEmbeddings
      from langchain_qdrant import QdrantVectorStore
    
      pdf_path = Path(__file__).parent / "file_name.pdf"
    
      loader = PyPDFLoader(file_path = pdf_path)
      docs = loader.load()
    
      text_splitter = RecursiveCharacterTextSplitter(
          chunk_size = 1000,
          chunk_overlap = 200,
      )
    
      split_docs = text_splitter.split_documents(documents = docs)
    
      embedder = OpenAIEmbeddings(
          model = "text-embedding-3-large",
          api_key = "OPENAI_API_KEY",
      )
    
      vector_store = QdrantVectorStore.from_documents(
          documents = [],
          url = "https://localhost:6333",
          collection_name = "collection_name",
          embedding = embedder
      )
    
      vector_store.add_documents(documents = split_docs)
      print("Injection Done")
    
      retriever = QdrantVectorStore.from_existing_collection(
          url = "https://localhost:6333",
          collection_name = "collection_name",
          embedding = embedder
      )
    
      relevant_chunks = retriever.similarity_search(
          query = "user_query",
      )
    
      SYSTEM_PROMPT = f"""
      You are a helpful assistant that responds based on the given context.
    
      Context:
      {relevant_chunks}
      """
    

    You can now use the required LLM to provide an enriched answer to the user.


Connect:

  • Let’s connect and explore the Generative AI together - Linkedin
29
Subscribe to my newsletter

Read articles from Ravi Mistry directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ravi Mistry
Ravi Mistry