Retrieval-Augmented Generation (RAG) is a new and effective method in the world of AI and large language models. It works by first searching for useful information from outside sources and then using that information to create better, more accurate responses. This method helps language models perform better, especially in tasks like answering questions, chatting with users, or generating written content.

In this blog post, we’ll learn about Retrieval-Augmented Generation (RAG) and build a basic RAG system step-by-step using Python and Ollama. This project will give you a clear idea of how RAG works and how to create one using simple coding techniques.

What is RAG

Lets first imagine a simple chatbot without any RAG. It somewhat will looks like below

While a chatbot can answer general questions based on what it has learned during training, it doesn’t always have the latest or specific information about certain topics.

For example, if you ask ChatGPT, “What is my father’s name?”, it won’t be able to answer because it doesn’t know personal details like your family members’ names — that information is not part of its training or memory.

To address this limitation, we need to provide external knowledge to the model (in this example, a list of family members' names):

A RAG system has two main parts:

A retriever, which looks for useful information from outside sources like a database, search engine, or document collection.
A language model, which uses that information to create a meaningful response.

There are also different styles of building RAG systems, such as Graph RAG, Hybrid RAG, and Hierarchical RAG.

A simple RAG system

Lets built a simple RAG that comprises of following components

1- An embedding model

2- Vector Database

3- Chat Model

4- knowledge base(An external dataset)

The architecture will look something like below

To build a RAG system, we need three important pieces:

Embedding Model – This is a pre-trained language model that turns text into numbers (called embeddings) that represent the meaning of the text. These embeddings help in finding related information from a dataset.
Vector Database – This is where we store information along with its embeddings. There are tools like Qdrant, Pinecone, or pgvector for this, but in this project, we’ll build a basic in-memory version ourselves.
Chatbot – This is the language model that reads the retrieved information and creates answers. It can be any model like Llama, Gemma, or OpenAI GPT models.

The RAG development process will have three phases

1- Indexing Phase

2- Retrieval Phase

3- Generation Phase

Lets understand & code each phase

Indexing Phase

The indexing phase is the first step in building a RAG system. In this step, we split the documents into smaller parts (called chunks) and then convert each chunk into a vector using the embedding model. These vectors help the system quickly find relevant information later when generating answers.

The size of each chunk can vary depending on the dataset and the application. For example, in a document retrieval system, each chunk can be a paragraph or a sentence. In a dialogue system, each chunk can be a conversation turn.

After the indexing phrase, each chunk with its corresponding embedding vector will be stored in the vector database. Here is an example of how the vector database might look like after indexing:

Chunk	Embedding Vector
Bananas are most eaten fruit in the world.	\[0.1, 0.04, -0.34, 0.21, ...\]
Japan has more elderly people that children.	\[-0.12, 0.03, 0.9, -0.1, ...\]
Cows has best friends and get stressed when seprated.	\[-0.02, 0.6, -0.54, 0.03, ...\]
...	...

We can use embedding vectors later to find the right information based on a question or search. It’s similar to how a SQL WHERE clause works, but instead of looking for exact matching words, we search using vectors that understand the meaning of the text. This helps us find related content, even if it’s written differently.

To compare the similarity between two vectors, we can use cosine similarity, Euclidean distance, or other distance metrics. Here we will use cosine similarity. Here is the formula for cosine similarity between two vectors A and B:

We will implement this formula in our python code.

Retrieval Phase

In the diagram below, we’ll look at an example where a user asks a question (called the input query).

We then convert this question into a vector (called the query vector) and compare it with the vectors stored in the database.

This helps us find the most relevant pieces of information to answer the question.

The result returned by The Vector Database will contains top N most relevant chunks to the query. These chunks will be used by the Chatbot to generate a response.

Lets code it. We will build a simple RAG using Python

To run the models, we’ll use Ollama, a command-line tool that lets you run AI models from Hugging Face directly on your computer — no need for cloud or server access.

We’ll use the following models:

Embedding model: bge-base-en-v1.5-gguf
Language model: Llama-3.2-1B-Instruct-GGUF

For our dataset, we’ll use a simple list of cat facts. Each fact will be treated as a chunk during the indexing phase.

Model Download

First download and install he Ollama CLI on your machine. After installed, open a terminal and run the following command to download the required models:

ollama pull hf.co/CompendiumLabs/bge-base-en-v1.5-gguf

ollama pull hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

The embedding model size around 50MB while chat model around 800MB. Make sure you have RAM on your machine to run these model.

Load the Dataset

Next, create a Python script and load the dataset into memory. The dataset contains a list of cat facts that will be used as chunks in the indexing phrase.

#Loading the dataset with Knowledge base data
dataset = []
with open('datasets/cat-facts.txt', 'r') as file:
  dataset = file.readlines()
#   print(f'Loaded {len(dataset)} entries')

Vector Database

Now, let’s build our own basic vector database.

We’ll use the embedding model through Ollama to turn each text chunk into a vector (a set of numbers that represent the meaning of the text).

Then, we’ll store each chunk along with its vector in a list so we can search through them later.

Here’s a simple function to generate an embedding vector for any given text:

# Each element in the VECTOR_DB will be a tuple (chunk, embedding)
# The embedding is a list of floats, for example: [0.1, 0.04, -0.34, 0.21, ...]
VECTOR_DB = [] # [("hello there", [0.1, 0.04, -0.34, 0.21, ...]), ...]

def add_chunk_to_database(chunk):
  embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
  VECTOR_DB.append((chunk, embedding))


# we will consider each line in the dataset as a chunk for simplicity.
for i, chunk in enumerate(dataset):
  add_chunk_to_database(chunk)
#   print(f'Added chunk {i+1}/{len(dataset)} to the database')

Retrieval System

Next, let's implement the retrieval function that takes a query and returns the top N most relevant chunks based on cosine similarity. We can imagine that the higher the cosine similarity between the two vectors, the "closer" they are in the vector space. This means they are more similar in terms of meaning.

Here is an function to calculate the cosine similarity between two vectors:

def cosine_similarity(a, b):
  """Calculate the cosine similarity between two vectors.
  Formula = a.b / ||a|| * ||b||
  Args:
    a (list): First vector.
    b (list): Second vector.
  Returns:
    float: Cosine similarity between the two vectors.

  """
  dot_product = sum([x * y for x, y in zip(a, b)])
  norm_a = sum([x ** 2 for x in a]) ** 0.5
  norm_b = sum([x ** 2 for x in b]) ** 0.5
  return dot_product / (norm_a * norm_b)

Now the retrieval function

def retrieve(query, top_n=3):
  user_query_embedding = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
  # temporary list to store (chunk, similarity) pairs
  similarities = []
  for chunk, db_embedding in VECTOR_DB:
    similarity = cosine_similarity(user_query_embedding, db_embedding)
    similarities.append((chunk, similarity))
  # sort by similarity in descending order, because higher similarity means more relevant chunks
  similarities.sort(key=lambda x: x[1], reverse=True)
  # finally, return the top N most relevant chunks
  return similarities[:top_n]

Generation Code

In this step, the chatbot will create a response using the relevant information retrieved earlier.

To do this, we simply add the retrieved chunks into the prompt, which is then passed to the chatbot as input. Here’s an example of how such a prompt might look:

input_query = input('Ask me a question: ')
retrieved_knowledge = retrieve(input_query)

print('Retrieved knowledge:')
for chunk, similarity in retrieved_knowledge:
  print(f' - (similarity: {similarity:.2f}) {chunk}')

instruction_prompt = f'''You are a helpful chatbot.
Use only the following pieces of context to answer the question. Don't make up any new information:
{'\n'.join([f' - {chunk}' for chunk, similarity in retrieved_knowledge])}
'''

We then use the ollama to generate the response. In this example, we will use instruction_prompt as system message:

stream = ollama.chat(
  model=LANGUAGE_MODEL,
  messages=[
    {'role': 'system', 'content': instruction_prompt},
    {'role': 'user', 'content': input_query},
  ],
  stream=True,
)

# print the response from the chatbot in real-time
print('Chatbot response:')
for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

Putting all together

You can find the final code in this file.

To run the code :

1- Clone the repo

2- Install the modules form requirements file

3- run the rag-code.py

Othere RAG Types

In practice, there are many ways to implement RAG systems. Here are some common types of RAG systems:

Graph RAG: In this type of RAG, the knowledge source is represented as a graph, where nodes are entities and edges are relationships between entities. The language model can traverse the graph to retrieve relevant information. There are many active researches on this type of RAG. Here is a collection of papers on Graph RAG.
Hybrid RAG: a type of RAG that combines Knowledge Graphs (KGs) and vector database techniques to improve question-answering systems. To know more, you can read the paper here.
Modular RAG: a type of RAG that goes beyond the basic "retrieve-then-generate" process, employing routing, scheduling, and fusion mechanisms to create a flexible and reconfigurable framework. This modular design allows for various RAG patterns (linear, conditional, branching, and looping), enabling more sophisticated and adaptable knowledge-intensive applications. To know more, you can read the paper here.

Conclusion

RAG is a big step forward in making language models smarter and more accurate. By building a simple RAG system from scratch, we’ve learned the basics of how embedding, retrieval, and response generation work together.

Even though our version is simple, it shows the core idea behind more advanced RAG systems used in real-world applications. There’s a lot of room to grow — like using faster vector databases or trying out advanced designs such as Graph RAG and Hybrid RAG.

As AI continues to improve, RAG will stay an important tool for giving language models access to outside knowledge while keeping their ability to generate natural responses.

References:

Ranking Models https://www.pinecone.io/learn/series/rag/rerankers/

For other types of RAG, you can refer to the this post by Rajeev Sharma.

Hugging face build your own RAG https://huggingface.co/blog/ngxson/make-your-own-rag

Creating your own RAG