Building a Real-Time PDF QA System with RAG, LangChain, Pinecone, and AWS


In today’s information-heavy world, professionals like lawyers, researchers, and analysts often deal with hundreds of pages of PDFs. Finding precise answers manually is time-consuming, and asking a standard LLM to summarize or answer can result in generic or inaccurate responses.
To solve this, I built a Retrieval-Augmented Generation (RAG) app that allows users to upload PDFs and get context-aware answers in real-time. This post dives deep into the architecture, workflow, libraries, and cloud deployment of the project.
Problem Statement
Legal and technical documents are often massive. Traditionally:
Lawyers or researchers need to manually search PDFs or eBooks for specific clauses or information.
LLMs can generate answers but often lack context, referencing irrelevant or incorrect information.
Goal: Create a system where users can:
Upload PDFs of any size.
Ask natural language queries.
Receive accurate, context-aware answers, retrieved directly from the document.
Tech Stack
Component | Purpose |
Flask | Web framework for handling PDF uploads, API endpoints, and user queries |
LangChain | Document parsing, chunking, embedding integration, and LLM orchestration |
OpenAI Embeddings (text-embedding-3-small ) | Convert text chunks into 1536-dimensional vectors for semantic search |
ChatOpenAI | Generate context-aware answers from retrieved chunks |
Pinecone (serverless) | Vector database for fast, scalable semantic search |
AWS Elastic Beanstalk | Cloud deployment with WSGI, scalable production environment |
GitHub CI/CD | Automated deployment pipeline for seamless updates |
System Architecture
PDF Upload & Processing
Users upload PDFs via a Flask web interface.
PDFs are parsed using
PyPDFLoader
and split into chunks withRecursiveCharacterTextSplitter
.Chunking ensures large documents are broken into manageable pieces for embeddings and retrieval.
Embedding & Vector Storage
Each chunk is converted into a 1536-dimensional embedding using OpenAI’s
text-embedding-3-small
.Vectors are stored in Pinecone (serverless) for fast semantic search, allowing real-time retrieval.
The use of Pinecone ensures scalable and persistent storage, unlike local databases like Chroma.
Query & Retrieval
Users submit natural language queries via Flask API.
The system uses
RetrievalQA
from LangChain to fetch the most relevant chunks from Pinecone.ChatOpenAI
synthesizes the retrieved information into a precise, context-aware answer, not just a generic LLM output.
Deployment on AWS
The app is deployed on AWS Elastic Beanstalk using WSGI for production-ready hosting.
Environment variables securely store OpenAI and Pinecone API keys.
CI/CD via GitHub Actions allows automatic updates when code changes.
Example Workflow
Upload a 200-page legal PDF.
Ask: “What are the termination clauses in this contract?”
The system:
Splits the document into chunks.
Embeds them into Pinecone.
Retrieves the top 4–5 relevant chunks.
Synthesizes an accurate, context-aware answer using ChatOpenAI.
Challenges & Solutions
Vector dimension mismatch:
Initially, the Pinecone index was 512-dim while embeddings were 1536-dim. Solution: recreate the Pinecone index with 1536 dimensions to match the OpenAI embeddings.Real-time updates:
Local vector stores (like Chroma) couldn’t handle dynamic uploads in production. Solution: switch to Pinecone serverless, allowing immediate ingestion and retrieval of new PDFs.Deployment errors (502 on EB):
Resolved by configuring WSGI correctly, setting the app entry asapp:app
, and ensuring all environment variables were loaded.
Future Enhancements
Return multiple relevant answers (5–6) instead of a single one for broader context.
Batch PDF uploads for faster ingestion.
Advanced ranking of retrieved answers based on relevance scores.
Optionally, integrate feedback loops to improve LLM responses over time.
Key Takeaways
Even a small RAG project demonstrates the power of combining LLMs, vector databases, and cloud infrastructure.
LangChain + Pinecone + OpenAI is a robust stack for building real-time knowledge retrieval systems.
Deploying on AWS Elastic Beanstalk makes the system scalable, secure, and production-ready.
Subscribe to my newsletter
Read articles from fakhir hassan directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

fakhir hassan
fakhir hassan
Student at Comsats Islamabad Will be completing my degree in 2026 here you will find all my daily learnings