RAG Basics & Environment Setup Guide

Before diving headfirst into building a Retrieval Augmented Generation (RAG) system with a specific framework, it's crucial to grasp the fundamental concepts. Understanding the "why" and "how" behind each component empowers you to build more robust, flexible, and efficient systems, regardless of the language or framework you choose.

This series will guide you through building a RAG system from the ground up, using open-source models and OpenAI-compatible APIs. Our goal is to avoid vendor lock-in and reliance on third-party SDKs or wrappers, giving you full control and understanding.

Key Concepts You Need to Know

What is RAG (Retrieval Augmented Generation)?

RAG is a technique that enhances the responses of Large Language Models (LLMs) by first retrieving relevant information from an external knowledge base. This retrieved context is then provided to the LLM along with the user's query, enabling it to generate more accurate, up-to-date, and contextually relevant answers.

What is a vector database?

A vector database is a specialised database designed to store, manage, and search through data represented as high-dimensional vectors, also known as embeddings. These databases excel at finding items with similar semantic meaning by comparing their vector representations.

What is an embedding?

An embedding is a numerical representation of data (like text, images, or audio) in a multi-dimensional space. For text, embeddings capture the semantic meaning, so words or sentences with similar meanings will have similar vector representations. These are generated by embedding models.

What is a prompt?

A prompt is the input text or instruction given to an LLM to guide its output. In a RAG system, the prompt typically includes the user's original question and the relevant context retrieved from the vector database.

What is a guard (guardrail)?

Guards, or guardrails, are mechanisms or checks implemented to ensure the safety, relevance, quality, or ethical alignment of the inputs to or outputs from an LLM. This can include filtering harmful content, preventing off-topic responses, or ensuring factual consistency.

What is a reranker?

A reranker is a model or process used to re-order a list of documents retrieved by an initial search (e.g., from a vector database). The goal is to improve the relevance of the top-ranked documents before they are passed to the LLM, leading to better quality answers.

Model Used in this Example:

Type	Model	Settings
Embeddings	bge-base-en-v1.5	hosted on cloudflare / Size = 768 / Distance = Cosine
Chat	llama-3-8b-instruct	hosted on cloudflare

Now you know you have all the information about the basics; let’s start with the setup.

Prerequisites & Setup

Let's get our environment ready. We'll need tools for LLM interaction, embedding generation, a vector database, and API testing.

LLM and Embedding Provider: We need a way to generate embeddings and interact with an LLM. You have a couple of great options:
- Locally Hosted (Self-Managed):
  - Ollama: Allows you to run open-source LLMs locally.
  - LMStudio: Another excellent tool for running LLMs on your machine. These provide OpenAI-compatible API endpoints, making them easy to integrate.
- Cloud-Based (Managed Service):
  - Cloudflare Workers AI: Cloudflare offers a generous free tier for their AI services, including access to various open-source LLMs and embedding models through OpenAI-compatible endpoints. This is a fantastic option for getting started quickly.
    - Sign up or log in at Cloudflare.com.
Vector Database: We'll use Qdrant, a powerful open-source vector database.
- Qdrant Cloud: Offers a managed service. You can set up an account here: Qdrant Cloud Account Setup.
- Self-Hosted Qdrant: You can also run Qdrant locally via Docker or other installation methods.
API Testing Tool: To interact with the APIs for embeddings, LLMs, and Qdrant, a tool like Insomnia or Postman will be very helpful.
- Download and install your preferred tool.
Gather Your Credentials: Once you've set up your Cloudflare/Ollama/LMStudio and Qdrant accounts/instances, make sure you have the following:
- API endpoint URLs for your chosen LLM and embedding model.
- Any necessary API keys.
- Your Qdrant instance URL and API key (if applicable).
Create Your First Qdrant Collection: A "collection" in Qdrant is like a table in a traditional database, but it stores your vectors. Let's create one for our RAG application. You can do this via the Qdrant dashboard or its API.

For example, usingcurl ⁣(replace placeholders with your actual Qdrant URL, API key, and desired collection name/vector parameters):
```
 curl --request PUT \
   --url http://localhost:6333/collections/replace_with_your_collection_name \
   --header 'Content-Type: application/json' \
   --header 'api-key: api-key' \
   --data '{
   "vectors": {
     "size": 768,
     "distance": "Cosine"            
   }
 }'
```
Note: The size of the vector must match the output dimension of the embedding model you choose (e.g., bge-base-en-v1.5 has 768 dimensions).

With these prerequisites in place, you're ready to move on to the next step: populating your vector database!

Part 1: Understanding RAG Fundamentals & Setting Up Your Environment

Table of contents

Key Concepts You Need to Know

Model Used in this Example:

Prerequisites & Setup

Subscribe to my newsletter

Debjit Biswas

Debjit Biswas