Implementing Advanced Retrieval-Augmented Generation (RAG) with Local Infrastructure Using Meta’s Llama-3

Spheron NetworkSpheron Network
3 min read

This guide explores setting up an Advanced Retrieval-Augmented Generation (RAG) system using the newly released Llama-3 model from Meta. This hands-on tutorial provides a step-by-step approach for creating an RAG pipeline that processes research papers and answers user queries based on the input data. The following technology stack will be used to construct the pipeline:

  • Ollama Embedding Model (mxbai-embed-large)

  • Ollama’s Quantized Llama-3 8B Model

  • Locally hosted Qdrant vector database

This setup ensures zero-cost deployment while maintaining complete security and data privacy.

What is HyDE?

HyDE (Hypothetical Document Embeddings) is a retrieval methodology introduced in Gao et al.'s 2022 paper "Precise Zero-Shot Dense Retrieval without Relevance Labels." It enhances zero-shot dense retrieval through a two-step process:

  1. Hypothetical Document Generation: A language model (e.g., GPT-3) is instructed to generate a hypothetical document based on a given query.

  2. Document Embedding: The hypothetical document is converted into an embedding using a contrastive encoder (e.g., Contriever). This embedding is then used to perform similarity searches.

HyDE simplifies dense retrieval into two tasks: generating hypothetical documents and refining their embeddings. This method outperforms traditional dense retrievers and competes with fine-tuned models across various tasks and languages.


Implementation

Let’s walk through the code implementation for setting up the RAG pipeline.

Initializing the Environment:

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    Settings,
    get_response_synthesizer)
from llama_index.core.query_engine import RetrieverQueryEngine, TransformQueryEngine
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode, MetadataMode
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
import qdrant_client
import logging

This block initializes the necessary libraries, models, and settings for the RAG system.

Setup Logging and Load Documents:

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load the research papers from a local directory
docs = SimpleDirectoryReader(input_dir="data", required_exts=[".pdf"]).load_data(show_progress=True)
text_parser = SentenceSplitter(chunk_size=512, chunk_overlap=100)

This part loads research papers as input and splits them into smaller text chunks.

Initialize Vector Store:

# Set up Qdrant vector store
logger.info("Initializing the vector store")
client = qdrant_client.QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=client, collection_name="research_papers")

Here, we initialize Qdrant, a locally hosted vector database, to store and retrieve document embeddings.

Configure Embedding Model:

logger.info("Initializing the embedding model")
embed_model = OllamaEmbedding(model_name='mxbai-embed-large', base_url='http://localhost:11434')
logger.info("Configuring global settings")
Settings.embed_model = embed_model
Settings.llm = Ollama(model="llama3", base_url='http://localhost:11434')
Settings.transformations = [text_parser]

This code initializes the Ollama embedding model and Llama-3 for generating text embeddings and processing.

Create Nodes and Index:

# Prepare text chunks for embedding
logger.info("Processing text chunks")
for doc_idx, doc in enumerate(docs):
    curr_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(curr_text_chunks)
    doc_ids.extend([doc_idx] * len(curr_text_chunks))

logger.info("Creating nodes")
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(text=text_chunk)
    src_doc = docs[doc_ids[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

logger.info("Generating embeddings for nodes")
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode=MetadataMode.ALL)
    )
    node.embedding = node_embedding

# Indexing nodes in vector store
logger.info("Setting up the storage context")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes=nodes, storage_context=storage_context, transformations=Settings.transformations)

This block processes the text into chunks, generates embeddings for each, and indexes them in the Qdrant vector store.

Querying with HyDE:

# Set up the query engine
logger.info("Setting up the query engine")
vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
response_synthesizer = get_response_synthesizer()
vector_query_engine = RetrieverQueryEngine(retriever=vector_retriever, response_synthesizer=response_synthesizer)

# Apply HyDE transformation
logger.info("Applying HyDE query transformation")
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(vector_query_engine, hyde)

# Query the system
logger.info("Retrieving response for the query")
response = hyde_query_engine.query(
    str_or_query_bundle="What are the data sets used in the paper's experiments?"
)
print(response)

client.close()

Finally, the system queries the indexed research papers using the HyDE method to retrieve relevant information based on the input query.


Conclusion

By utilizing Meta’s advanced Llama-3 model, Ollama embeddings, and the HyDE method for document retrieval, we’ve created a highly efficient and private RAG system. With local infrastructure, zero cost, and enhanced query capabilities, this pipeline effectively combines cutting-edge models and methodologies to process and retrieve information from large documents like research papers. With the right fine-tuning of parameters, such as top_k and chunk_sizeThis system's accuracy and speed can be significantly improved, ensuring robust performance and security in practical applications.

0
Subscribe to my newsletter

Read articles from Spheron Network directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Spheron Network
Spheron Network

On-demand DePIN for GPU Compute