Building a Complete RAG Pipeline from Scratch: Data Ingestion to Retrieval

UJJWAL BALAJIUJJWAL BALAJI
5 min read

Introduction to RAG and Why It Matters

Retrieval-Augmented Generation (RAG) has revolutionised how we interact with large language models by combining the power of information retrieval with generative AI. In this comprehensive guide, we'll build a complete RAG pipeline from the ground up, covering everything from document structure understanding to vector database implementation and query retrieval.

RAG addresses critical limitations of traditional LLMs:

  • Knowledge cutoffs: LLMs lack real-time information

  • Domain specificity: Generic models struggle with specialized knowledge

  • Accuracy concerns: Hallucinations can be reduced with verified sources

The Two Core Pipelines of RAG

1. Data Ingestion Pipeline

This is where we process and prepare our knowledge base:

  • Document loading and parsing

  • Chunking strategy implementation

  • Embedding generation

  • Vector database storage

2. Query Retrieval Pipeline

This handles user interactions:

  • Query processing and embedding

  • Similarity search in vector database

  • Context retrieval

  • Response generation (covering in next part)

Understanding Document Structure: The Foundation

At the heart of any RAG system lies the document structure. LangChain provides a standardised way to handle documents through two core components:

from langchain_core.documents import Document

# Creating a document manually
doc = Document(
    page_content="This is the main text content I'm using to create RAG",
    metadata={
        "source": "example.txt",
        "pages": 1,
        "author": "Ujjwal",
        "date_created": "2024-01-01"
    }
)

Why metadata matters:

  • Enables filtered searching (e.g., "find documents by author X")

  • Provides context about the source material

  • Enhances retrieval accuracy through additional filtering dimensions

Practical Implementation: Building the Pipeline

Setting Up the Environment:

# Using UV for package management
uv init rag-project
cd rag-project
uv env python=3.13.2
uv add langchain langchain-core langchain-community
uv add pymupdf sentence-transformers chromadb

Document Loading Strategies

Text File Loading:

from langchain.document_loaders import TextLoader

loader = TextLoader("data/text_files/python_intro.txt", encoding="utf-8")
documents = loader.load()

Directory Loading (Multiple Files):

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "data/text_files/",
    glob="*.txt",
    show_progress=False
)
all_documents = loader.load()

PDF Processing with PyMuPDF:

from langchain_community.document_loaders import PyMuPDFLoader

pdf_loader = DirectoryLoader(
    "data/pdf_files/",
    glob="*.pdf",
    loader_cls=PyMuPDFLoader
)
pdf_documents = pdf_loader.load()

The Power of Chunking

Why chunking is essential:

  • Embedding models have fixed context windows

  • LLMs have token limits

  • Improves retrieval precision by creating focused content segments

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)

Embedding Management Class:

from sentence_transformers import SentenceTransformer
import numpy as np

class EmbeddingManager:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        """Load the sentence transformer model"""
        self.model = SentenceTransformer(self.model_name)
        print(f"Model {self.model_name} loaded successfully")
        print(f"Embedding dimension: {self.model.get_sentence_embedding_dimension()}")

    def generate_embeddings(self, texts):
        """Generate embeddings for list of texts"""
        return self.model.encode(texts, show_progress_bar=True)

# Initialize embedding manager
embedding_manager = EmbeddingManager()

Vector Store Implementation

import chromadb
import os
from uuid import uuid4

class VectorStore:
    def __init__(self, collection_name="documents", persistent_dir="data/vector_store"):
        self.collection_name = collection_name
        self.persistent_dir = persistent_dir
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """Initialize ChromaDB client and collection"""
        os.makedirs(self.persistent_dir, exist_ok=True)
        self.client = chromadb.PersistentClient(path=self.persistent_dir)

        self.collection = self.client.get_or_create_collection(
            name=self.collection_name,
            metadata={"hnsw:space": "cosine"}
        )
        print(f"Collection '{self.collection_name}' initialized")
        print(f"Existing documents in collection: {self.collection.count()}")

    def add_documents(self, documents, embeddings):
        """Add documents and embeddings to vector store"""
        if len(documents) != len(embeddings):
            raise ValueError("Documents and embeddings must have same length")

        # Prepare data for ChromaDB
        ids = [str(uuid4()) for _ in range(len(documents))]
        metadatas = [doc.metadata for doc in documents]
        documents_text = [doc.page_content for doc in documents]
        embedding_list = [embedding.tolist() for embedding in embeddings]

        # Add to collection
        self.collection.add(
            ids=ids,
            embeddings=embedding_list,
            metadatas=metadatas,
            documents=documents_text
        )
        print(f"Added {len(documents)} documents to collection")
        print(f"Total documents in collection: {self.collection.count()}")

# Initialize vector store
vector_store = VectorStore()

Storing Documents in Vector Database

# Extract text from chunks
texts = [chunk.page_content for chunk in chunks]

# Generate embeddings
embeddings = embedding_manager.generate_embeddings(texts)

# Store in vector database
vector_store.add_documents(chunks, embeddings)

Building the RAG Retriever

The retriever handles query processing and similarity search:

from sklearn.metrics.pairwise import cosine_similarity

class RAGRetriever:
    def __init__(self, vector_store, embedding_manager):
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query, top_k=5, score_threshold=0.0):
        """Retrieve relevant documents for a query"""
        # Convert query to embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])

        # Query the vector store
        results = self.vector_store.collection.query(
            query_embeddings=query_embedding.tolist(),
            n_results=top_k
        )

        # Process results
        retrieved_docs = []
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0], 
            results['metadatas'][0], 
            results['distances'][0]
        )):
            similarity_score = 1 - distance  # Convert distance to similarity

            if similarity_score >= score_threshold:
                retrieved_docs.append({
                    'content': doc,
                    'metadata': metadata,
                    'similarity_score': similarity_score,
                    'id': results['ids'][0][i]
                })

        return retrieved_docs

# Initialize retriever
rag_retriever = RAGRetriever(vector_store, embedding_manager)

Testing the Retrieval System

# Test query
query = "What is attention mechanism in transformers?"
results = rag_retriever.retrieve(query, top_k=3)

for i, result in enumerate(results):
    print(f"Result {i+1} (Score: {result['similarity_score']:.3f}):")
    print(f"Content: {result['content'][:200]}...")
    print("---")

Key Insights and Best Practices

  1. Document Structure is Fundamental: Proper metadata management enables powerful filtering capabilities

  2. Chunking Strategy Matters:

    • Smaller chunks for precise retrieval

    • Overlap maintains context continuity

    • Size should match your embedding model's capabilities

  3. Embedding Model Selection:

    • all-MiniLM-L6-v2 provides 384 dimensions with good performance

    • Consider larger models for complex semantic tasks

  4. Vector Database Choices:

    • ChromaDB is excellent for prototyping and production

    • Consider Pinecone or Weaviate for scale

  5. Similarity Thresholds: Implement score filtering to ensure only relevant context is retrieved

Next Steps and Assignment

Your Assignment:

  1. Extend this pipeline to handle Excel/CSV files

  2. Implement support for web content scraping

  3. Experiment with different chunking strategies

  4. Try alternative embedding models from Hugging Face

What's Coming Next:
In the next instalment, we'll complete the RAG pipeline by:

  • Integrating LLMs for response generation

  • Creating a seamless query-response interface

  • Implementing response evaluation metrics

  • Packaging everything into a production-ready application

Conclusion

Building a RAG system from scratch provides deep insights into how retrieval-augmented generation actually works under the hood. By understanding each component—from document processing to vector storage and retrieval—you gain the ability to customize and optimize each stage for your specific use case.

The modular approach shown here allows for easy experimentation and scaling. Whether you're building a research paper assistant, customer support bot, or internal knowledge management system, these foundations will serve you well.

Remember: The quality of your RAG system depends on the quality of your data processing. Invest time in proper document structure, thoughtful chunking, and appropriate metadata management—it will pay dividends in retrieval accuracy and user experience.

0
Subscribe to my newsletter

Read articles from UJJWAL BALAJI directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

UJJWAL BALAJI
UJJWAL BALAJI

I'm a 2024 graduate from SRM University, Sonepat, Delhi-NCR with a degree in Computer Science and Engineering (CSE), specializing in Artificial Intelligence and Data Science. I'm passionate about applying AI and data-driven techniques to solve real-world problems. Currently, I'm exploring opportunities in AI, NLP, and Machine Learning, while honing my skills through various full stack projects and contributions.