"What is RAG? Retrieval-Augmented Generation Explained for Beginners"

Have you ever asked a chatbot a question, only to get an outdated or made-up answer? Large Language Models (LLMs) like ChatGPT, Gemini, and Claude are very smart, but sometimes they're not up to date.

Retrieval-Augmented Generation (RAG) solves this problem by combining LLMs with real-time data retrieval, making responses more accurate and up-to-date. Retrieval-Augmented Generation (RAG) steps in to help AI give more accurate and current responses.

What is RAG?

RAG is a powerful technique that improves the capabilities of LLMs by fetching relevant information from an external knowledge source (like a database or the web) before generating a response. This integration allows the model to provide answers grounded in real-time data, enhancing the reliability and accuracy of its outputs.

Think of it like a student (the LLM) who doesn’t just rely on memory but also looks up facts in a textbook (the external database) before answering a question.

Ok, trying my luck with this great meme trend! Using a contextual RAG… | Adir Ben-Yehuda

How Does RAG Work?

RAG follows a simple 3-step process:

Retrieval: Find Relevant Information When a user asks a question, RAG looks through a knowledge base (like a vector database) for relevant documents. It uses semantic search to find the best content, not just keyword matching.
Augmentation: Add the Retrieved Data to the Question. The documents found are combined with the user's original question. This gives the LLM more context.
Generation: Create a Better Answer. The LLM then makes a response using both its pre-trained knowledge and the new data. This leads to more accurate and fact-based answers.

Working of RAG

Document: The process starts with a document or a group of documents that hold the information the system will use.
Chunking Process: These documents are divided into smaller parts called chunks to make them easier to manage and search through efficiently.
Embedding Model: Each chunk is passed through an embedding model, which converts the text into high-dimensional vectors that represent semantic meaning.
Vector Store (Database): The embeddings are stored in a vector database(like pinecone, Chroma, Weaviate, Qdrant), which allows fast and accurate similarity-based retrieval.
User Prompt: A user submits a query or prompt. This input is also embedded using the same embedding model.
Retriever: The system compares the embedded user query with the stored vectors and retrieves the most relevant chunks from the database.
LLM (Large Language Model): The retrieved chunks are sent to the LLM along with the original prompt. This is the prompt augmentation phase, where external knowledge is combined with the question.
Response Output: The LLM generates an answer using both the query and the retrieved context. This output is then presented to the user.

Here a simple RAG made using LangChain

  from pathlib import Path
  from langchain_community.document_loaders import PyPDFLoader
  from langchain_text_splitters import RecursiveCharacterTextSplitter
  from langchain_openai import OpenAIEmbeddings
  from langchain_qdrant import QdrantVectorStore

  pdf_path = Path(__file__).parent / "file_name.pdf"

  loader = PyPDFLoader(file_path = pdf_path)
  docs = loader.load()

  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = 1000,
      chunk_overlap = 200,
  )

  split_docs = text_splitter.split_documents(documents = docs)

  embedder = OpenAIEmbeddings(
      model = "text-embedding-3-large",
      api_key = "OPENAI_API_KEY",
  )

  vector_store = QdrantVectorStore.from_documents(
      documents = [],
      url = "https://localhost:6333",
      collection_name = "collection_name",
      embedding = embedder
  )

  vector_store.add_documents(documents = split_docs)
  print("Injection Done")

  retriever = QdrantVectorStore.from_existing_collection(
      url = "https://localhost:6333",
      collection_name = "collection_name",
      embedding = embedder
  )

  relevant_chunks = retriever.similarity_search(
      query = "user_query",
  )

  SYSTEM_PROMPT = f"""
  You are a helpful assistant that responds based on the given context.

  Context:
  {relevant_chunks}
  """

Conclusion

RAG is a breakthrough for AI applications, making LLMs smarter by mixing their reasoning with real-world data. Whether you're creating a chatbot, research tool, or business assistant, RAG helps provide accurate, current, and reliable answers.

Next Steps

Experiment with LangChain (popular RAG frameworks).
Try embedding models (Openai, Hugging Face).
Explore vector databases (Pinecone, qdrant, Weaviate, FAISS).

RAG: A Beginner's Guide to Retrieval-Augmented Generation