This Blog starts with basics of RAG then goes into series of blogs that covers fine tuning techniques.

RAG (Retrieval Augmented Generation)

Let’ first try to understand what is RAG and whys is it required? LLM (i.e. Large Language Models) like GPT, Gemini are highly trained with the data that is available on the internet (i.e. generally is public domain). Because of which they are useful to provide general answers. Now consider an organization, the data in an organization is specific to that company or the domain in which it deals. Since the data is specific to an organization, the chances are LLM would not have access to that data and is not trained with it. Without having knowledge of your organization data, LLM may provide general answers to the query, but it may not be specific or relevant to an organization. That is where RAG (Retrieval Augmented Generation) comes in the picture. The context of data which specific to the organization is send along with the query, to get responses that is relevant and specific to the domain and organization.

This document uses will explain the code using LangChain in Python with OpenAI side by side. As an example, we would take a PDF document and use it with RAG to query details from it. I have used a PDF that I had written as an article on Architecture Assessment Frameworks.

This series comes with code that can be referred at GitHub on https://github.com/ashutoshmca/RAG/

Prerequisite

Install package for OpenAI, LangChain, Qdrant (a vector store), PyPDF (to parse PDF file)

  pip install langchain-openai langchain-community langchain-qdrant langchain-text-splitters qdrant-client pypdf openai python-dotenv

Create a .env file and place your Open AI Key
```
  OPENAI_API_KEY=”Your key”
```
Install Vector Store, we would be using Qdrant vector store on docker. To install Qdrant vector store create a docker-compose.yml file. Please note, If you are running Qdrant vector store on docker, you would also require docker desktop on you local environment.
```
  services:
    qdrant:
      image: qdrant/qdrant
      ports:
        - 6333:6333
```

Run command to deploy docker image forn Qdrant

   Run command docker compose -f .\docker-compose.yml up

You may verify, if Qdrant is deployed successfully by using url http://localhost:6333/dashboard on your browser.

Search with RAG

With RAG, when sending user’s query, the most relevant context specific to that query of company’s data is also sent to LLM along with the query. Which provides the context to LLM and it helps in answering user’s query. Company’s data may be stored in different format (i.e. structured like database and unstructured formats like PDF, Docx etc.) across multiple system and database. It may be very costly to retrieve the context across multiple systems and databases on the fly to user’s query. Not only it is going to be expensive, and time consuming, the context may even not be relevant to the query. RAG helps in solving this problem.

A typical RAG application consists of three components:

Indexing: This pipeline ingest data from source and index it.
Retrieval and Generation: This is the RAG chain, that takes user query at run time, retrieves relevant data from the index, pass it to LLM and generates the response.

Indexing

Indexing of data consist of a pipeline that is done before retrieval of query and response generation. It consists of:

Loading data
Chunking/Splitting
Generate Embedding
Store Vector Embedding

We will deep dive into each step one by one.

Loading data

The data is loaded from data source for further processing. LangChain provide different document loader that can be used to load document from different sources (e.g. directory. Web pages, Google drive etc.) in different formats (e.g. PDF, CSV, JSON, Markdown etc.).

E.g.

In this example, we will use PyPDFLoader document loader that provides integration with open source pypdf python library for reading and parsing pdf files. If you want to load directory instead of a pdf file, you may use PyPDFDirectoryLoader instead. For list of other document loaders, you may please refer to Document Loaders.

Load PyPDFLoader and load document using it.

from pathlib import Path
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore

from dotenv import load_dotenv

load_dotenv()

file_path = Path().resolve() / "<yourfile>.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

Chunking/Splitting

It is necessary to split the large documents into different chunks because of multiple reasons:

Context Window limitation of LLM: The large documents may not fit with LLM’s context window.
Memory efficient: Splitting large documents into chunks make the process memory efficient.
Relevant context for LLM: It further helps in providing more focused content that is relevant to query from multiple sections of the document.

There can be multiple approaches to split the document. E.g. Based on length. LangChain provide different splitters to implement which are based on number of token and number of characters.The documents consist of multiple paragraphs and recursively of sentence and words. We would like to keep paragraphs intact. For the same purpose instead of using length-based splitter we would use RecursiveCharacterTextSplitter. If it cannot keep paragraphs intact, it moves to the next level (e.g. sentences) or even to the word level (if necessary).

Code snippet

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

split_docs = text_splitter.split_documents(documents=docs)

Above code splits document into chunks size of up to 1000 words. It provides overlap size of 200 words between consecutive chunks.

Generate embedding

The chunks are then provided to embedding model to generate vector embeddings. The vector embedding represents the semantic meaning of the data, which helps in searching the chunks relevant to the query later.

embedder = OpenAIEmbeddings(
    model="text-embedding-3-large"
)

Store vector embedding

The vector embeddings are then stored to a vector store, so that information can be retrieved later from the vector store.

E.g.

We would be using QdrantVectorStore. But any other vector store like PineCone can also be used.

vector_store = QdrantVectorStore.from_documents(
     documents=[],
     url="http://localhost:6333",
    collection_name="learning_langchain",
     embedding=embedder
)

vector_store.add_documents(documents=split_docs)

Once the embeddings are stored, you may go to http://localhost:6333/dashboard in your browser and verify embeddings. Navigate to collections and open the collection name that you specified in the code e.g. learning_rag

You can now see chunks and their embeddings:

Click on Default Vector to copy embeddings and paste in an editor to view it’s embeddings.

e.g.

[-0.026093854,0.0024256662,-0.026624918,-0.037662417,-0.010606907,-0.01878815,0.009774431 …]

Retrieval and Generation

Once the documents are stored as chunks, and their embedding is created and stored in vector store, it can be used for searching content relevant to the query. In this process. The process consists of retrieval of the chunks from the vector store and generating the response by including prompt that includes relevant context (i.e. retrieved from the vector store) and query asked by the user. Following diagram represents the flow:

User asks query from the application or bot
Application or Bot converts query to vector embeddings thru embedding model.
It then searches most relevant content /chunks in vector store. There would be multiple chunks from different section that matches the semantic meaning of the query would be returned by vector store.
The relevant chunks, along with system prompt and query is provided to LLM.
LLM uses relevant chunks, along with system prompt and query and generates output.
Response is provided back to the user.

Retrieval

As mentioned above, retrieval searches most relevant content /chunks in vector store. Following section shows code snippet for retrieval of relevant chunks.

embedder = OpenAIEmbeddings(
    model="text-embedding-3-large"
)


retriver = QdrantVectorStore.from_existing_collection(
    url="http://localhost:6333",
    collection_name="learning_langchain",
    embedding=embedder
)

#query = "Who is Author? Provide details about the author"
query = "Provide summary of the document"
search_result = retriver.similarity_search(
    query=query
)
print("Relevant Chunks", search_result)

Generation

Create a System prompt with place holder for context, where we would add chunks retrieved from vector store for the query.

SYSTEM_PROMPT = """You are a helpful assistant that helps the user to learn details only with in the provided context.
If the context does not contain the answer, say "I don't know".
You are not allowed to make any assumptions or guesses.

Ouutput the answer in a JSON format.

context:
{context}
"""

Since now we have the relevant chunks for query from retriever, we can provide it as context along with the prompt and query to the LLM to generate the response

messages=[
            { "role": "system", "content": SYSTEM_PROMPT.format(context=search_result) }, 
            { "role": "user", "content": query }
        ]

client = OpenAI()      
result = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=messages
)

print("Response:", result.choices[0].message.content)

Following is an example of response received from LLM for the document that was stored in our local file system.

Response: {
    "summary": "The document 'Architecture Assessment Frameworks Comparative Analysis' written by Ashutosh Gupta, discusses the evaluation of architecture against quality goals such as performance, scalability, security, reliability, modifiability, and usability. It compares various architecture assessment methods, detailing when each might be suitable. The document also covers ABAS descriptions and SAAM activities. ABAS includes problem description, stimulus/response measures, architectural style, and analysis. SAAM involves characterizing functional partitioning, mapping it onto architecture, selecting quality attributes and tasks, and evaluating architectural support for these tasks."
}

Summary

RAG can be used to generate the response using LLM that is specific to an organization data. It helps in providing response that is relevant to the organization. Without RAG, LLM would generate response that would otherwise be generic in nature. RAG uses indexing pipeline to index data in vector store. It first split large documents into chunks, then it creates it vector embeddings and store it into vector store.

When user makes a query, it’s vector embeddings are generated from vector store. The chunks relevant to the query are found in the vector store. These chunks along with system prompt(optional) and query is passed to the LLM to generate response. In next blogs, we would discuss about the fine-tuning techniques.

References

Introduction | 🦜️🔗 LangChain

LangChain Python API Reference — 🦜🔗 LangChain documentation

https://github.com/ashutoshmca/RAG/

RAG (Retrieval Augmented Generation) Basics