Building a Real-Time PDF QA System with RAG, LangChain, Pinecone, and AWS

fakhir hassanfakhir hassan
3 min read

In today’s information-heavy world, professionals like lawyers, researchers, and analysts often deal with hundreds of pages of PDFs. Finding precise answers manually is time-consuming, and asking a standard LLM to summarize or answer can result in generic or inaccurate responses.

To solve this, I built a Retrieval-Augmented Generation (RAG) app that allows users to upload PDFs and get context-aware answers in real-time. This post dives deep into the architecture, workflow, libraries, and cloud deployment of the project.


Problem Statement

Legal and technical documents are often massive. Traditionally:

  • Lawyers or researchers need to manually search PDFs or eBooks for specific clauses or information.

  • LLMs can generate answers but often lack context, referencing irrelevant or incorrect information.

Goal: Create a system where users can:

  1. Upload PDFs of any size.

  2. Ask natural language queries.

  3. Receive accurate, context-aware answers, retrieved directly from the document.


Tech Stack

ComponentPurpose
FlaskWeb framework for handling PDF uploads, API endpoints, and user queries
LangChainDocument parsing, chunking, embedding integration, and LLM orchestration
OpenAI Embeddings (text-embedding-3-small)Convert text chunks into 1536-dimensional vectors for semantic search
ChatOpenAIGenerate context-aware answers from retrieved chunks
Pinecone (serverless)Vector database for fast, scalable semantic search
AWS Elastic BeanstalkCloud deployment with WSGI, scalable production environment
GitHub CI/CDAutomated deployment pipeline for seamless updates

System Architecture

  1. PDF Upload & Processing

    • Users upload PDFs via a Flask web interface.

    • PDFs are parsed using PyPDFLoader and split into chunks with RecursiveCharacterTextSplitter.

    • Chunking ensures large documents are broken into manageable pieces for embeddings and retrieval.

  2. Embedding & Vector Storage

    • Each chunk is converted into a 1536-dimensional embedding using OpenAI’s text-embedding-3-small.

    • Vectors are stored in Pinecone (serverless) for fast semantic search, allowing real-time retrieval.

    • The use of Pinecone ensures scalable and persistent storage, unlike local databases like Chroma.

  3. Query & Retrieval

    • Users submit natural language queries via Flask API.

    • The system uses RetrievalQA from LangChain to fetch the most relevant chunks from Pinecone.

    • ChatOpenAI synthesizes the retrieved information into a precise, context-aware answer, not just a generic LLM output.

  4. Deployment on AWS

    • The app is deployed on AWS Elastic Beanstalk using WSGI for production-ready hosting.

    • Environment variables securely store OpenAI and Pinecone API keys.

    • CI/CD via GitHub Actions allows automatic updates when code changes.


Example Workflow

  1. Upload a 200-page legal PDF.

  2. Ask: “What are the termination clauses in this contract?”

  3. The system:

    • Splits the document into chunks.

    • Embeds them into Pinecone.

    • Retrieves the top 4–5 relevant chunks.

    • Synthesizes an accurate, context-aware answer using ChatOpenAI.


Challenges & Solutions

  • Vector dimension mismatch:
    Initially, the Pinecone index was 512-dim while embeddings were 1536-dim. Solution: recreate the Pinecone index with 1536 dimensions to match the OpenAI embeddings.

  • Real-time updates:
    Local vector stores (like Chroma) couldn’t handle dynamic uploads in production. Solution: switch to Pinecone serverless, allowing immediate ingestion and retrieval of new PDFs.

  • Deployment errors (502 on EB):
    Resolved by configuring WSGI correctly, setting the app entry as app:app, and ensuring all environment variables were loaded.


Future Enhancements

  • Return multiple relevant answers (5–6) instead of a single one for broader context.

  • Batch PDF uploads for faster ingestion.

  • Advanced ranking of retrieved answers based on relevance scores.

  • Optionally, integrate feedback loops to improve LLM responses over time.


Key Takeaways

  • Even a small RAG project demonstrates the power of combining LLMs, vector databases, and cloud infrastructure.

  • LangChain + Pinecone + OpenAI is a robust stack for building real-time knowledge retrieval systems.

  • Deploying on AWS Elastic Beanstalk makes the system scalable, secure, and production-ready.


0
Subscribe to my newsletter

Read articles from fakhir hassan directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

fakhir hassan
fakhir hassan

Student at Comsats Islamabad Will be completing my degree in 2026 here you will find all my daily learnings