Introduction

Why RAG?

Fine tuning is a great method to modify our LLM model and given weights according to our needs. But, it has some disadvantages too. Fine Tuning is really expensive (especially full parameter fine tuning) as it requires a lot of GPU to process. the process of fine tuning is really time consuming as it requires to change weights of model by finding errors between original output and actual output and correcting it.
The most important part, Fine tuning also has knowledge cutoff, so we cannot get all the information that we require .

RAG addresses all these issues. RAG has majorly two functions:

BASIC RAG:

Indexing
Retrieval
Generation

RAG (or Retrieval Augmented Generation) brings relevant data to your context /prompt. Before deep diving into this, let us first understand the concept of Context window.

Context Window: At a given point of time , how many tokens can you process as you chat more, before context gets removed. Context window also has limitations, so your RAG must be very very optimized. For example, a business may have over 1 lakh rows in its Database but user may require only 40-50 rows, so you should give only those 40-50 rows only to AI model. There are 2 cases possible for this:

PDF file is small: Convert pdf directly into text and give that text to system prompt, as it will be of very small size so context window limit will not reach.
PDF file is big: In this case, we need to do indexing. We basically index the input source so that if user asks something, we can directly point to that in database. Basically, first break your pdf’s in different chunks, then apply vector embeddings to it, then store it in some vector DB.

RAG has 2 parts:

One Pre processed and other on processed, i.e. you can store vector embeddings of pdf that you get before user gives an input query, and when user gives query, you need to convert his/her query to embeddings too .

Once embeddings of User query are made, then you need to map these embeddings to the already present embeddings in Database. After completing this, you will get relevant embeddings/. Take this embeddings and pass it to Data source with help of LLM to get relevant chunks of documents. You can collect all these chunks and give it to LLM to merge its info(if needed).

Another way is to store all the data in embeddings only, as in form of metadata, so that there is no need to give embeddings again to source.

The ART of Chunking

Chunking is an art, if you cannot chunk the data properly then your vector embeddings will not be good. So, is it good to overlap chunks?

Yes, if we do overlapping in chunking then there might be data repetition in chunks but relevant information will not be lost. For example, if some chunk gets break in half sentence of pdf, then first half and second half of sentence will be in different chunks, so we cannot take out information about the particular chunk through embeddings.

Also chunking is usually required in large pdf’s, where number of tokens of pdf exceeds the context window size.

Link for code of RAG: Github_repo_for_RAG

Below are the links present for all the concepts of Advance RAG

Link for Advanced Rag Introduction: https://shreypaunwala.hashnode.dev/from-rags-to-riches-advanced-rag-intro

From RAGS to riches