Enhancing AI with Retrieval-Augmented Generation

Introduction

Retrieval-Augmented Generation (RAG) is one of the popular approaches in Generative AI used to generate content. As the name suggests—made up of three words: Retrieval, Augmented, and Generation—RAG is the process of fetching necessary information or content from external sources (Retrieval), enhancing that information to make the final result more accurate or relevant (Augmented), and then using all of it to generate content based on the user’s query (Generation).

Why go the RAG way?

Large Language Models (LLMs) are already pretrained on vast amounts of data and can generate highly relevant responses to a user’s query. So, it’s natural to wonder—why take the extra step and use Retrieval-Augmented Generation (RAG)? Why not just include all the context along with the query and let the LLM handle it?

Below are few reasons to help us accept the RAG way:

Limited Context Window: LLM have something called the Context Window —a maximum limit on the amount of text they can process at once. When the info required to answer a query exceeds the LLM context window, a lot of relevant data is lost resulting in failure or low quality of generated content. RAG solves this by dynamically retrieving small relevant chunks of information before generating a response hence preventing overflowing the context window size.
Personalized and Domain Specific Content: LLM are pretrained on data, which is not user specific or domain specific, hence the only way out is to fine tune an LLM which in most scenarios isn’t flexible and quite expensive. RAG helps to pull in specific tailored documents in real time without the need of modifying the underlying model.
Knowledge Cut-Off: LLM models which were trained on human generated content are frozen at a certain point, i.e. they do not know anything that has happened after that time period. RAG on the other hand allows models to retrieve the latest data from the sources provided by the user.
Improved Transparency: It is easier to back track and verify the origin of generated responses which increases reliability and trust in businesses involved in Legal, Healthcare etc. work.

Basic Steps in RAG

The basic RAG pipeline involves majorly two steps:

Indexing: In this step, data is collected from multiple sources that might be useful for answering future user queries. This data is then processed and stored in a way (often using embeddings and a vector database) that makes it easy to search and retrieve later.
Retrieving: When a user submits a query, the system searches through the indexed data to find the most relevant pieces of information related to that query. These retrieved results are then passed along with the query to the language model to generate a more accurate and informed response.
Generation: When the content is retrieved, the next step is to use this content and feed it to the LLM for the content generation process

INDEXING

The very first step in the process of creating a RAG pipeline is the indexing process. This has several sub-steps as described below. (NOTE: we would be using Langchain + Qdrant Db in this pipeline)

Getting the Data Source

The very first step in the indexing process is to get the data source, it can be in any form, i.e. pdf, audio, images, text files etc. This would be used as referencing material to answer the future user’s Query. We would
```
 pdf_path = Path(__file__).parent / "nodejs.pdf"
 loader = PyPDFLoader(pdf_path)
 docs = loader.load()
```
Chunking

This step involves breaking a large document into smaller, manageable pieces. This is important because large files like PDFs are often too big to search or process effectively all at once. By splitting the content into smaller chunks, it's easier to find and retrieve the most relevant parts when answering a user’s query. It also helps keep the information within the size limits that language models can handle. In our case, we’re using LangChain’s RecursiveCharacterTextSplitter, which breaks the PDF into chunks based on character count
```
 text_splitter =RecursiveCharacterTextSplitter(
     chunk_size=1000,
     chunk_overlap=200
 )

 split_docs =text_splitter.split_documents(docs)
```
Create Embedding & Storing:

Once the chunking process is complete, the next step is to store these chunks in a vector database. To do that, we first need to convert each chunk into a vector embedding—a numerical representation that captures the meaning of the text. After generating these embeddings, we create a new collection in the vector database and store the embeddings there. This allows us to later search and retrieve the most relevant chunks based on a user’s query.
```
 embedder = OpenAIEmbeddings(
     model="text-embedding-3-large",
     api_key=""
 )
 #the url is for locally deployed Qdrant db using docker
 vector_store = QdrantVectorStore.from_documents(
     documents=[],
     url="http://localhost:6333",
     collection_name="learning_langchain",
     embedding=embedder
 )
 vector_store.add_documents(documents=split_docs)
```

This marks the completion of the 1st part of the basic RAG pipeline i.e. the indexing process.

RETRIEVING

This is the second part in the RAG pipeline which is used to find relevant data for the user query according to the data sources which were stored in the vector DB in the first half. The sub steps are as following:

User Query:

The first step would be to get the user Query, It could be through any means, for simplicity reasons, we would be using the terminal to get the user Query as of now.
```
 query = input("What do you want to know? ")

 #example user query below
 #-> What is FS Module
```

Intialise the Retriver DB: To search for relevant content, we need to initialise the DB where we need to look for content for the user query.

 retriver = QdrantVectorStore.from_existing_collection(
     url="http://localhost:6333",
     collection_name="learning_langchain",
     embedding=embedder
 )

Create Embedding

Now take the user Query and create vector embeddings for it and do a similarity search in the vector Db store earlier created. This will help us to find chunks of data which are similar to the user Query, hence helping to find relevant chunks of data
```
 search_result = retriver.similarity_search(
     query=query
 )
```

This marks the end of the second step in the RAG pipeline i.e. the retrieval phase.

GENERATION

This step involves using the chunks of data retrieved from the DB for the user query to be given to a LLM to generate personalized response data. The sub steps are as follow:

Combine the chunks: We might get a number of chunks; we could limit the number of chunks while retrieving and get the most relevant chunks and then combine them together.
```
 # Combine retrieved content
 retrieved_context = "\n".join([doc.page_content for doc in search_result])
```

System Prompt

Create an appropriate system prompt for the LLM and feed the chunks into it.

 final_prompt = f"""You are a pdf reader who is given context of the user Query
 RULES: 
 1.You need to use the context and using the context you need to answer the user query
 2.In case you do not have enough context to answer the user Query ask the user to rephrase the query
 3.Generate the content in a professional way

 Context:
 {retrieved_context}

 Question:
 What is FS Module?

 Answer: The answer needs to be generated from the context itself"""

Initialize LLM model

Initialize the gpt model and use it to get the final response.

 # Initialize LLM
 llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.2)

 # Get the response
 response = llm([HumanMessage(content=final_prompt)])

Conclusion

This concludes the basic overview of the RAG pipeline. This was a high-level understanding of how a RAG application works, but there are many other important factors to consider in the process. In later blogs, we will dive deeper into these aspects, such as chunking and other key processes, to explore them in more detail.

To sum up, RAG is a powerful approach that enhances content generation by combining retrieval with generation, providing more accurate, personalized, and relevant responses.

Retrieval-Augmented Generation (RAG)

Table of contents