The Unofficial Guide to Not Screwing Up Your RAG Ingestion

Subroto KumarSubroto Kumar
6 min read

Let’s talk about the real RAG problem.

No, not your vector DB. Not your fancy embeddings.
I’m talking about your ingestion pipeline — and yeah, it’s probably a host mess.

You built a cool chatbot. It answers stuff. You feel proud.
Then someone uploads a 300-page scanned PDF...
Your Lambda chokes. Your auth service starts pretending it’s S3.
Your DB gets spammed with heartbeat updates like it thinks it's a pub-sub system
Suddenly, your “smart assistant” is just... assistant.

Here’s what I keep seeing in the wild:

  • Your Auth or main service is moonlighting as a file server like it’s 2009

  • Your Lambda is trying to OCR a 300-page scanned PDF — and quietly weeping in its 128MB coffin

  • You’ve got a status tracker spamming the DB every 2 seconds like it’s farming XP

  • And then you wonder why stuff breaks, times out, or goes missing in action

Let’s be real for a second.

Retrieval-Augmented Generation (RAG) is the secret sauce behind actually-useful AI assistants.
It combines two things:

  • a powerful language model

  • and your private knowledge base (PDFs, docs, chats, wikis — all the fun stuff)

But the real chaos?
It doesn’t come from the retrieval or the generation.
It comes from the ingestion — getting your raw files into a clean, chunked, embedded, indexed, and searchable state and the truth is most people are building this completely wrong.

Your ingestion pipeline is not “just glue code.” It’s the backbone of your RAG system. And if that backbone is spaghetti — guess what? Everything else collapses with it.

So I wrote this guide:
“How to Build a RAG Ingestion Pipeline — The Right Way”

Let’s fix that.


Common Mistakes: What People Do Wrong (and Why It Hurts)


1. Auth/Main Service Handles File Uploads

What Happens:
Client sends a file to the auth service. The auth service uploads it to storage and kicks off processing.

Why It's Bad:

  • Tight coupling of auth and upload = bigger blast radius

  • Scaling uploads means scaling the auth service (bad idea 😬)

  • Security issues: Auth services should never touch unvalidated file content

  • Increases attack surface + security debt

Do This Instead:

  • Implement a dedicated Media Ingestion Service with strict validation

  • Use pre-signed URLs to let the client upload directly to S3/Blob

  • Validate metadata and access control before issuing the URL

  • Scan files for malware, viruses, and PII leaks

  • Add rate limits and enforce tenant isolation


2. Lambda Does All the Processing

Using short-lived Lambda functions for the entire processing chain leads to timeouts and processing failures.

What Happens:
Your Lambda function:

  • Downloads file from S3

  • Extracts content

  • Generates embeddings

  • Writes to vector DB

  • Updates DB with status

Why It's Bad:

  • Timeout risks (Lambda has hard limits)

  • Memory constraints with large files or GPU embedding

  • Cold starts = latency

  • You’re forcing a stateless tool to do stateful, long-running tasks

  • Database connection in Lambda:

    DB connection needs to be handled carefully due to the ephemeral and concurrent nature of Lambda functions. Multiple Lambda invocations can quickly exhaust connections. Each cold start can create a new DB connection, increasing load.

Do This Instead:

  • Lambda should only trigger events — push jobs to SQS / Kafka / PubSub

  • Offload heavy lifting to containerized workers (e.g., ECS, Azure Container Apps)

  • Scale processing independently from event handling


3. Using Your Database as a Message Bus

What Happens:
Your ingestion writes job status to a SQL/NoSQL DB and also uses it to notify consumers of state changes.

Why It's Bad:

  • Databases aren't message buses

  • You get polling, race conditions, or worse — broken job chains

  • No retry/delivery semantics

Do This Instead:

  • Use Redis or PubSub for real-time status updates

  • Let workers emit events for each stage (extract, chunk, embed, store)

  • Persist final states to DB, but stream progress via events


4. Chunking Without Thinking

What Happens:
Simple token-based chunking destroys document semantics and reduces retrieval quality. Developers naively split documents into 512-token chunks and call it a day.

Why It's Bad:

  • Context can get lost mid-sentence

  • Embeddings become noisy

  • Retrieval quality tanks

Do This Instead:

  • Implement multi-level chunking strategy:

    • Structural chunking (sections, pages)

    • Semantic chunking (topics, concepts)

    • Hierarchical embedding (document → section → paragraph)

  • Preserve metadata relationships between chunks

  • Store original document structure for reconstruction

  • Implement embedding fusion techniques for improved retrieval


5. Direct Embedding Calls for Every Chunk

What Happens:
Each chunk gets sent for embedding via an API call. One by one. Sequential embedding API calls waste time and money.

Why It's Bad:

  • You’ll hit API rate limits fast

  • You’re burning money

  • Latency becomes linear with chunk count

Do This Instead:

  • Use batch embedding APIs (OpenAI, Cohere support this)

  • Deploy self-hosted embedding models on GPU clusters

  • Use async task queues to parallelize work

  • Implement adaptive batching based on document size

  • Use embedding caching for duplicate content

  • Pipeline parallelism to overlap extraction and embedding

  • Configure auto-scaling based on queue depth


How to Do It Right: The Ideal RAG Ingestion Flow

[Client]
   |
   |-- (Auth Service issues presigned URL)
   |
[Storage: S3 / Blob]
   |
   |-- (Upload complete → Event Triggered)
   |
[Lambda] → [Queue: SQS/Kafka]
   |
[Worker Cluster: ECS/K8s/Container Apps]
   |
   |-- Extract Text (PDF, DOCX, etc.)
   |-- Chunk + Clean
   |-- Embed (Batch or Self-Hosted)
   |-- Save to Vector DB (Pinecone, Qdrant, Weaviate, etc.)
   |
[Emit Status Events]
   |
[Redis / WebSocket → Client Dashboard]

Bonus: Real Optimizations That Matter

  • Deduplication:
    Avoid reprocessing identical documents or versions.

    • Fingerprint files using hash + metadata

    • Store ingestion history for diff-based reprocessing

  • Rate-limiting smartly:
    Don’t let a noisy tenant starve the queue.

    • Use per-tenant queues or priorities

    • Enforce org/user-level quotas at the ingestion gateway

  • Compression:
    Using API-based embedding services?

    • Compress large text chunks (e.g., whitespace stripping, minification)

    • Shrink tokens → lower cost, faster requests

  • Retry strategies:
    Not all retries are created equal.

    • Stage-aware retry logic (extraction ≠ embedding ≠ indexing)

    • Use DLQs (dead-letter queues) to catch poisoned jobs

  • Observability:
    You can’t improve what you can’t see.

    • Track per-job latency, chunk counts, and throughput

    • Instrument pipeline stages with spans + metrics

    • Surface bottlenecks early


Implementation Approach

  1. Start with Backbone: Build the core data flow with minimal features

  2. Add Resilience: Implement retry and recovery mechanisms

  3. Enhance Observability: Ensure visibility into every stage

  4. Optimize Performance: Identify and address bottlenecks

  5. Add Advanced Features: Layer in security, compliance, and analytics


Conclusion: Build It Like You Mean It

Building RAG ingestion at scale isn’t about hacks — it’s about separation of concerns, event-driven design, and treating your system like a real product.

A clean ingestion pipeline is the foundation for accurate, scalable RAG systems.

Build it right the first time, and you’ll spend your time scaling — not debugging.

0
Subscribe to my newsletter

Read articles from Subroto Kumar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Subroto Kumar
Subroto Kumar

Seasoned software engineer with expertise extends across requirement definition and application implementation with diverse range of programming languages and technologies. I can effortlessly dance between the front-end and back-end realms, crafting digital experiences that seamlessly blend functionality with killer aesthetics.