Building a Complete RAG Pipeline from Scratch: Data Ingestion to Retrieval


Introduction to RAG and Why It Matters
Retrieval-Augmented Generation (RAG) has revolutionised how we interact with large language models by combining the power of information retrieval with generative AI. In this comprehensive guide, we'll build a complete RAG pipeline from the ground up, covering everything from document structure understanding to vector database implementation and query retrieval.
RAG addresses critical limitations of traditional LLMs:
Knowledge cutoffs: LLMs lack real-time information
Domain specificity: Generic models struggle with specialized knowledge
Accuracy concerns: Hallucinations can be reduced with verified sources
The Two Core Pipelines of RAG
1. Data Ingestion Pipeline
This is where we process and prepare our knowledge base:
Document loading and parsing
Chunking strategy implementation
Embedding generation
Vector database storage
2. Query Retrieval Pipeline
This handles user interactions:
Query processing and embedding
Similarity search in vector database
Context retrieval
Response generation (covering in next part)
Understanding Document Structure: The Foundation
At the heart of any RAG system lies the document structure. LangChain provides a standardised way to handle documents through two core components:
from langchain_core.documents import Document
# Creating a document manually
doc = Document(
page_content="This is the main text content I'm using to create RAG",
metadata={
"source": "example.txt",
"pages": 1,
"author": "Ujjwal",
"date_created": "2024-01-01"
}
)
Why metadata matters:
Enables filtered searching (e.g., "find documents by author X")
Provides context about the source material
Enhances retrieval accuracy through additional filtering dimensions
Practical Implementation: Building the Pipeline
Setting Up the Environment:
# Using UV for package management
uv init rag-project
cd rag-project
uv env python=3.13.2
uv add langchain langchain-core langchain-community
uv add pymupdf sentence-transformers chromadb
Document Loading Strategies
Text File Loading:
from langchain.document_loaders import TextLoader
loader = TextLoader("data/text_files/python_intro.txt", encoding="utf-8")
documents = loader.load()
Directory Loading (Multiple Files):
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader(
"data/text_files/",
glob="*.txt",
show_progress=False
)
all_documents = loader.load()
PDF Processing with PyMuPDF:
from langchain_community.document_loaders import PyMuPDFLoader
pdf_loader = DirectoryLoader(
"data/pdf_files/",
glob="*.pdf",
loader_cls=PyMuPDFLoader
)
pdf_documents = pdf_loader.load()
The Power of Chunking
Why chunking is essential:
Embedding models have fixed context windows
LLMs have token limits
Improves retrieval precision by creating focused content segments
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
Embedding Management Class:
from sentence_transformers import SentenceTransformer
import numpy as np
class EmbeddingManager:
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.model_name = model_name
self.model = None
self._load_model()
def _load_model(self):
"""Load the sentence transformer model"""
self.model = SentenceTransformer(self.model_name)
print(f"Model {self.model_name} loaded successfully")
print(f"Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
def generate_embeddings(self, texts):
"""Generate embeddings for list of texts"""
return self.model.encode(texts, show_progress_bar=True)
# Initialize embedding manager
embedding_manager = EmbeddingManager()
Vector Store Implementation
import chromadb
import os
from uuid import uuid4
class VectorStore:
def __init__(self, collection_name="documents", persistent_dir="data/vector_store"):
self.collection_name = collection_name
self.persistent_dir = persistent_dir
self.client = None
self.collection = None
self._initialize_store()
def _initialize_store(self):
"""Initialize ChromaDB client and collection"""
os.makedirs(self.persistent_dir, exist_ok=True)
self.client = chromadb.PersistentClient(path=self.persistent_dir)
self.collection = self.client.get_or_create_collection(
name=self.collection_name,
metadata={"hnsw:space": "cosine"}
)
print(f"Collection '{self.collection_name}' initialized")
print(f"Existing documents in collection: {self.collection.count()}")
def add_documents(self, documents, embeddings):
"""Add documents and embeddings to vector store"""
if len(documents) != len(embeddings):
raise ValueError("Documents and embeddings must have same length")
# Prepare data for ChromaDB
ids = [str(uuid4()) for _ in range(len(documents))]
metadatas = [doc.metadata for doc in documents]
documents_text = [doc.page_content for doc in documents]
embedding_list = [embedding.tolist() for embedding in embeddings]
# Add to collection
self.collection.add(
ids=ids,
embeddings=embedding_list,
metadatas=metadatas,
documents=documents_text
)
print(f"Added {len(documents)} documents to collection")
print(f"Total documents in collection: {self.collection.count()}")
# Initialize vector store
vector_store = VectorStore()
Storing Documents in Vector Database
# Extract text from chunks
texts = [chunk.page_content for chunk in chunks]
# Generate embeddings
embeddings = embedding_manager.generate_embeddings(texts)
# Store in vector database
vector_store.add_documents(chunks, embeddings)
Building the RAG Retriever
The retriever handles query processing and similarity search:
from sklearn.metrics.pairwise import cosine_similarity
class RAGRetriever:
def __init__(self, vector_store, embedding_manager):
self.vector_store = vector_store
self.embedding_manager = embedding_manager
def retrieve(self, query, top_k=5, score_threshold=0.0):
"""Retrieve relevant documents for a query"""
# Convert query to embedding
query_embedding = self.embedding_manager.generate_embeddings([query])
# Query the vector store
results = self.vector_store.collection.query(
query_embeddings=query_embedding.tolist(),
n_results=top_k
)
# Process results
retrieved_docs = []
for i, (doc, metadata, distance) in enumerate(zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
)):
similarity_score = 1 - distance # Convert distance to similarity
if similarity_score >= score_threshold:
retrieved_docs.append({
'content': doc,
'metadata': metadata,
'similarity_score': similarity_score,
'id': results['ids'][0][i]
})
return retrieved_docs
# Initialize retriever
rag_retriever = RAGRetriever(vector_store, embedding_manager)
Testing the Retrieval System
# Test query
query = "What is attention mechanism in transformers?"
results = rag_retriever.retrieve(query, top_k=3)
for i, result in enumerate(results):
print(f"Result {i+1} (Score: {result['similarity_score']:.3f}):")
print(f"Content: {result['content'][:200]}...")
print("---")
Key Insights and Best Practices
Document Structure is Fundamental: Proper metadata management enables powerful filtering capabilities
Chunking Strategy Matters:
Smaller chunks for precise retrieval
Overlap maintains context continuity
Size should match your embedding model's capabilities
Embedding Model Selection:
all-MiniLM-L6-v2
provides 384 dimensions with good performanceConsider larger models for complex semantic tasks
Vector Database Choices:
ChromaDB is excellent for prototyping and production
Consider Pinecone or Weaviate for scale
Similarity Thresholds: Implement score filtering to ensure only relevant context is retrieved
Next Steps and Assignment
Your Assignment:
Extend this pipeline to handle Excel/CSV files
Implement support for web content scraping
Experiment with different chunking strategies
Try alternative embedding models from Hugging Face
What's Coming Next:
In the next instalment, we'll complete the RAG pipeline by:
Integrating LLMs for response generation
Creating a seamless query-response interface
Implementing response evaluation metrics
Packaging everything into a production-ready application
Conclusion
Building a RAG system from scratch provides deep insights into how retrieval-augmented generation actually works under the hood. By understanding each component—from document processing to vector storage and retrieval—you gain the ability to customize and optimize each stage for your specific use case.
The modular approach shown here allows for easy experimentation and scaling. Whether you're building a research paper assistant, customer support bot, or internal knowledge management system, these foundations will serve you well.
Remember: The quality of your RAG system depends on the quality of your data processing. Invest time in proper document structure, thoughtful chunking, and appropriate metadata management—it will pay dividends in retrieval accuracy and user experience.
Subscribe to my newsletter
Read articles from UJJWAL BALAJI directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

UJJWAL BALAJI
UJJWAL BALAJI
I'm a 2024 graduate from SRM University, Sonepat, Delhi-NCR with a degree in Computer Science and Engineering (CSE), specializing in Artificial Intelligence and Data Science. I'm passionate about applying AI and data-driven techniques to solve real-world problems. Currently, I'm exploring opportunities in AI, NLP, and Machine Learning, while honing my skills through various full stack projects and contributions.