Building a RAG System with LangChain: A Step-by-Step Guide

Ahmad MayahiAhmad Mayahi
8 min read

In the ever-evolving landscape of AI applications, Retrieval Augmented Generation (RAG) has emerged as a game-changer for creating more accurate, reliable, and contextually aware AI systems. If you've ever been frustrated by an LLM hallucinating facts or providing outdated information, RAG might be the solution you're looking for.

What We're Building

In this guide, we'll build a practical RAG system using LangChain that can answer questions based on your own documents. By the end, you'll have a working application that:

  1. Ingests documents from various sources

  2. Processes and indexes them efficiently

  3. Retrieves relevant information when queried

  4. Generates accurate, contextually-aware responses

This isn't just theoretical—we'll walk through actual Python code that you can adapt for your own projects, whether you're building a documentation assistant, a customer support bot, or a research tool.

Why RAG Matters

Large Language Models (LLMs) like GPT-4 are impressive, but they have limitations:

  • They can hallucinate or make up information

  • Their knowledge is limited to their training data

  • They can't access your private or specialized information

RAG addresses these issues by combining the generative capabilities of LLMs with a retrieval system that pulls relevant information from your own data sources. This creates a powerful synergy: the retrieval component provides factual grounding, while the LLM handles the natural language understanding and generation.

The Architecture of a RAG System

A typical RAG application consists of two main components:

1. Indexing Pipeline (Happens Offline)

This is where we prepare our data:

  • Load: Ingest documents from various sources

  • Split: Break documents into manageable chunks

  • Embed: Convert text chunks into vector embeddings

  • Store: Index these embeddings in a vector database

2. Retrieval and Generation Pipeline (Happens at Query Time)

This is the runtime flow:

  • Query Processing: Understand what the user is asking

  • Retrieval: Find the most relevant document chunks

  • Context Augmentation: Add retrieved information to the prompt

  • Generation: Use an LLM to create a response based on the augmented context

Now, let's implement this architecture using LangChain.

Setting Up Your Environment

First, let's install the necessary packages:

pip install langchain langchain-openai chromadb

Next, we'll set up our environment with the necessary API keys:

import os
from dotenv import load_dotenv

load_dotenv()

# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

Building the Indexing Pipeline

Step 1: Loading Documents

LangChain provides various document loaders for different data sources. Let's start with a simple example using PDF files:

from langchain_community.document_loaders import PyPDFLoader

# Load a PDF file
loader = PyPDFLoader("./data/company_handbook.pdf")
documents = loader.load()

print(f"Loaded {len(documents)} document pages")

You can also load from multiple sources:

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    UnstructuredURLLoader
)

# Load from multiple sources
loaders = [
    PyPDFLoader("./data/company_handbook.pdf"),
    TextLoader("./data/faq.txt"),
    UnstructuredURLLoader(urls=["https://www.example.com/about"])
]

documents = []
for loader in loaders:
    documents.extend(loader.load())

print(f"Loaded {len(documents)} document pages from multiple sources")

Step 2: Splitting Documents

Large documents need to be split into smaller chunks for effective retrieval:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Number of characters per chunk
    chunk_overlap=200,  # Overlap between chunks to maintain context
    length_function=len,
)

# Split documents into chunks
document_chunks = text_splitter.split_documents(documents)

print(f"Split into {len(document_chunks)} chunks")

The chunk_size and chunk_overlap parameters are crucial:

  • Too large chunks might retrieve irrelevant information

  • Too small chunks might lose context

  • Overlap helps maintain continuity between chunks

Step 3: Creating Embeddings and Vector Store

Now we'll convert our text chunks into vector embeddings and store them:

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Initialize the embeddings model
embeddings = OpenAIEmbeddings()

# Create a vector store
vectorstore = Chroma.from_documents(
    documents=document_chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Persist the vector store to disk
vectorstore.persist()

This creates a searchable database of your document chunks. The OpenAIEmbeddings class uses OpenAI's embedding model to convert text into vector representations, which Chroma then indexes for efficient similarity search.

Building the Retrieval and Generation Pipeline

Step 1: Setting Up the Retriever

The retriever is responsible for finding relevant document chunks:

# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",  # Use similarity search
    search_kwargs={"k": 4}  # Return top 4 chunks
)

Step 2: Creating the RAG Chain

Now we'll combine the retriever with an LLM to create our RAG system:

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

# Initialize the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Create a RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # "stuff" means we stuff all retrieved docs into the prompt
    retriever=retriever,
    return_source_documents=True  # Include source documents in the response
)

Step 3: Using the RAG System

With our RAG system set up, we can now ask questions:

# Ask a question
query = "What is our company's vacation policy?"
result = qa_chain({"query": query})

# Print the answer
print("Question:", query)
print("Answer:", result["result"])
print("\nSources:")
for i, doc in enumerate(result["source_documents"]):
    print(f"Source {i+1}:")
    print(doc.page_content[:200] + "...\n")

Advanced RAG Techniques

Using Custom Prompts

The default prompts in LangChain are good, but you can customize them for better results:

from langchain.prompts import PromptTemplate

# Define a custom prompt template
template = """
You are a helpful assistant that answers questions based on the provided context.

Context:
{context}

Question:
{question}

Answer the question based only on the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer this question."

Answer:
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

# Create a chain with the custom prompt
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

Implementing Metadata Filtering

You can add metadata to your documents and use it for filtering:

# When creating document chunks, add metadata
for i, chunk in enumerate(document_chunks):
    chunk.metadata = {
        "source": chunk.metadata.get("source", "unknown"),
        "page": chunk.metadata.get("page", 0),
        "category": "HR" if "vacation" in chunk.page_content.lower() else "General"
    }

# When creating the retriever, add filtering
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 4,
        "filter": {"category": "HR"}  # Only retrieve HR documents
    }
)

Building a Conversational RAG System

For a more interactive experience, you can create a conversational RAG system:

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

# Create a memory object to store conversation history
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# Create a conversational RAG chain
conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory
)

# Have a conversation
response = conversational_chain({"question": "What is our vacation policy?"})
print("Answer:", response["answer"])

# Ask a follow-up question
response = conversational_chain({"question": "How many days do I get?"})
print("Answer:", response["answer"])

Evaluating Your RAG System

It's important to evaluate your RAG system to ensure it's providing accurate and helpful responses:

from langchain.evaluation.qa import QAEvalChain

# Define evaluation examples
examples = [
    {
        "query": "What is our company's vacation policy?",
        "answer": "Employees receive 20 days of paid vacation per year."
    },
    {
        "query": "Who is eligible for health insurance?",
        "answer": "All full-time employees are eligible for health insurance after 30 days."
    }
]

# Generate predictions
predictions = []
for example in examples:
    result = qa_chain({"query": example["query"]})
    predictions.append({"query": example["query"], "result": result["result"]})

# Create an evaluation chain
eval_chain = QAEvalChain.from_llm(llm)

# Evaluate the predictions
graded_outputs = eval_chain.evaluate(examples, predictions)

# Print the evaluation results
for i, (example, prediction) in enumerate(zip(examples, predictions)):
    print(f"Example {i+1}:")
    print("Query:", example["query"])
    print("Expected:", example["answer"])
    print("Predicted:", prediction["result"])
    print("Grade:", graded_outputs[i]["text"])
    print()

Putting It All Together: A Complete RAG System

Here's a complete example that ties everything together:

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Load environment variables
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

# 1. Load documents
loader = PyPDFLoader("./data/company_handbook.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} document pages")

# 2. Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)
document_chunks = text_splitter.split_documents(documents)
print(f"Split into {len(document_chunks)} chunks")

# 3. Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=document_chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
vectorstore.persist()

# 4. Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# 5. Define custom prompt
template = """
You are a helpful assistant that answers questions based on the provided context.

Context:
{context}

Question:
{question}

Answer the question based only on the provided context. If the context doesn't contain the answer, say "I don't have enough information to answer this question."

Answer:
"""
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

# 6. Create LLM and QA chain
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

# 7. Interactive question answering
while True:
    query = input("\nEnter your question (or 'quit' to exit): ")
    if query.lower() == 'quit':
        break

    result = qa_chain({"query": query})

    print("\nAnswer:", result["result"])
    print("\nSources:")
    for i, doc in enumerate(result["source_documents"]):
        print(f"Source {i+1}:")
        print(doc.page_content[:200] + "...\n")

Conclusion

RAG systems represent a powerful approach to enhancing LLMs with external knowledge. By following this guide, you've learned how to:

  1. Build an indexing pipeline to process and store your documents

  2. Create a retrieval system that finds relevant information

  3. Combine retrieval with an LLM to generate accurate, contextual responses

  4. Implement advanced techniques like custom prompts and metadata filtering

  5. Evaluate your RAG system's performance

The applications for RAG are vast and growing. From customer support and documentation assistants to research tools and knowledge management systems, RAG enables you to create AI applications that are both powerful and trustworthy.

Further Reading

By implementing these techniques, you're well on your way to creating more accurate, reliable, and contextually aware AI applications that can leverage your own data sources.

0
Subscribe to my newsletter

Read articles from Ahmad Mayahi directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ahmad Mayahi
Ahmad Mayahi