A Basic Local Retrieval-Augmented Generation (RAG) Tool with LangChain

Gregor SoutarGregor Soutar
15 min read

LLMs Are Stuck in the Past

Unlike humans, who continuously seek, absorb, and apply new information, Large Language Models (LLMs) in their basic form are frozen in time. Their knowledge is limited to what they learned during their last pre-training. They can’t naturally update themselves with new events or discoveries.

Pre-training is an expensive, resource-heavy process. It involves collecting vast amounts of text data and refining it into a statistical model that recognises patterns. Rather than storing knowledge like a database, the model learns relationships between words, allowing it to predict the next token in a sequence. While this creates the illusion of understanding, LLMs don’t actually "reason" or "remember" as humans do - they recognise patterns based on probability.

Since pre-training happens infrequently, LLMs have a knowledge cutoff. For example, if a model was last trained in November 2024, it won’t “know” anything beyond that date. So, if an LLM relies only on pre-training, it won’t be able to answer questions about recent news, ongoing events, or the latest scientific discoveries.

LLMs are strongest when answering questions about frequently discussed topics. If something is widely covered on the internet (like general science concepts, historical events, or common knowledge) it’s likely the model will provide accurate answers. But for rarer, less-documented information, its knowledge is hazy. For example, an LLM will confidently tell you the capital of Scotland or the laws of physics. But if you ask about a little-known research paper, a local news story, or an obscure hobby, it may struggle. It may even make something up.

A Familiar Fix

As humans, when we encounter something we don’t know, we turn to external resources like Google, instruction manuals, or documentation. For complex tasks like advanced math, we rely on specialised tools such as calculators. We recognise our own limitations and use these tools to help us reach the right answer.

LLMs can follow the same approach. By integrating real-time search capabilities, accessing external databases, or working alongside other specialised tools, they can overcome the gaps in their knowledge. Instead of relying solely on what they already know, they can be designed to seek out the right information, just like we do. One way of doing this is called Retrieval-Augmented Generation (RAG), which I have explored for the first time.

Retrieval-Augmented Generation (RAG)

In the simplest interactions with an LLM, you provide text as input, which is then tokenised and placed into the model's context window; the LLMs working memory. The LLM processes this input using its fixed model and generates a response solely based on learned patterns and probabilities. With RAG, your input (or prompt) is first processed to identify relevant external content from sources like databases, documents, or the internet. This retrieved information is then combined with your prompt and placed into the model's working memory. By incorporating this additional context, the LLM generates a more informed and accurate response, grounded in the relevant data.

I decided to explore the use-case where a set of documents serve as a knowledge base for a simple AI agent capable of answering questions about the documentation. My goal was to see if I could host an LLM locally on my laptop and have the agent function entirely offline, without relying on any paid models.

Creating the RAG Tool with LangChain

I did a bit of research, and watched a couple of YouTube tutorials around the topic.

The first video provided a great overview of how to use your own documents effectively with an LLM. While a single document or a small set may sometimes be sufficient, handling larger information sets requires a more advanced approach like RAG.

The next video introduced Ollama, a tool that allows you to run and manage large language models (LLMs) locally on your device, making it easier to experiment without relying on cloud services. It also covered LangChain and walked through a tutorial on building a simple RAG-based LLM agent using PDFs as a knowledge base. The process seemed straightforward and showed how easy it would be to host my own LLM instance on my laptop.

With these videos, and following the first part of a LangChain tutorial, I was able to implement a simple local agent that is able to answer questions on my own documents.

Choosing a Model

The first step in the build was to select a LLM and embeddings model. The LLM is the part that will take our query, context and produce a human-like response. The embedding model will be responsible for taking chunks of our documentation and producing embeddings.

Embeddings are a powerful way to represent words as numerical vectors in a high-dimensional space, capturing their semantic relationships. As a simple example, consider the words "king", "queen", "man", and "woman". An embedding model might assign them vectors:

$$\text{king} \rightarrow \begin{bmatrix} 0.8 \\ 0.6 \\ 0.7 \end{bmatrix}, \quad \text{queen} \rightarrow \begin{bmatrix} 0.8 \\ 0.7 \\ 0.6 \end{bmatrix}, \quad \text{man} \rightarrow \begin{bmatrix} 0.9 \\ 0.5 \\ 0.8 \end{bmatrix}, \quad \text{woman} \rightarrow \begin{bmatrix} 0.9 \\ 0.6 \\ 0.7 \end{bmatrix}$$

These numerical representations allow the model to capture meaningful relationships. Vector arithmetic can then be used to work with our text. For example, by performing the operation,

$$\text{king} - \text{man} + \text{woman}$$

$$\begin{bmatrix} 0.8 \\ 0.6 \\ 0.7 \end{bmatrix} - \begin{bmatrix} 0.9 \\ 0.5 \\ 0.8 \end{bmatrix} + \begin{bmatrix} 0.9 \\ 0.6 \\ 0.7 \end{bmatrix} = \begin{bmatrix} 0.8 \\ 0.7 \\ 0.6 \end{bmatrix}$$

the result is our vector, our embedding, of the word queen.

For this simple local agent I chose llama3.2 as my LLM and nomic-embed-text as my embeddings model. Both were available through Ollama.

# Select chat model
llm = OllamaLLM(model="llama3.2")

# Select embeddings model
embeddings = OllamaEmbeddings(model="nomic-embed-text")

Selecting a Vector Store

I then chose Chroma as the vector store - a database that stores the embeddings. This is what was used in the second video. I didn’t like the idea of storing the embeddings in memory (suggested in the LangChain tutorial) where the embeddings would have to be re-created each time the program was run. The database provides persistent storage of the embeddings. Other options were available.

# Select vector store where embeddings are stored
vector_store_dir = "chroma_db"
vector_store = Chroma(embedding_function=embeddings, persist_directory=vector_store_dir)

Loading Documents

Information may be loaded from many sources (websites, Word documents, PDFs) using the corresponding LangChain document loaders. I was interested in loading all of the PDFs in a directory, so I use the PyPDFDirectoryLoader.

Loaders typically have .load() and .lazy_load() functions. The lazy load function returns a generator which I think can be helpful in applications that require memory and performance optimisation.

By default, the PDF document loader will not parse any images. I looked into enabling this feature with Tesseract as I liked the idea of the content of diagrams in my documentation being utilised. However, the specific PDFs I was interested in using had images that were unsupported by this optical character recognition (OCR) library. I proceeded without this potentially useful feature.

Once loaded, we get a list of documents. Each element in this list is a page from our PDFs. For example, if the folder contained a PDF with 2 pages, and another with 6, we would get a python list with 8 elements; our 8 pages.

# Meaningful name to describe the data being used in the vector store
collection_name = "pdf_docs"

# Loader for all of the documents in the pdf_data directory
pdf_loader = PyPDFDirectoryLoader(collection_name)
docs = pdf_loader.load()

Splitting Documents

Inserting entire pages of documents into a LLMs context window would be inefficient. It would be unlikely that an entire page would be relevant to your prompt, and models may struggle to find relevant content. As a next step we split our documents (our pages) into smaller chunks.

In this case the RecursiveCharacterSplitter is used to split pages into 1000 character chunks, with an overlap of 200 characters. These values wont be arbitrary, but I have not experimented with larger or smaller chunks sizes, and simply took the values from the LangChain tutorial. It would be interesting to see the effect that different chunk sizes and overlap had.

As a concrete example, I gave my agent two PDFs totalling 231 pages, so 231 documents. This was split into 681 smaller chunks. Note that the chunks have useful meta data like the name of the original document they came from.

# Split the documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # chunk size (characters)
    chunk_overlap=200,     # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(docs)

Adding New Splits to the Database

I was keen for the embeddings not to be re-generated with each run of my agent. If I was using a paid-for model to generate embeddings, I would be concerned with the needless expense of doing this repeatedly. For free models, I am more concerned with the time it takes to generate the embeddings constantly, and the time this takes.

I was able to follow documentation on LangChain Indexing that outlined how to efficiently manage embeddings. I provide the indexing function with document chunks, a record_manager that allows inspection of the database contents, and a Chroma vector_store configured with our chosen embedding model. I opt for the 'full' cleanup mode, which ensures efficient management by:

  • De-duplicating content

  • Removing chunks whose source PDFs no longer exist

  • Updating chunks when their original PDFs have changed

  • Retaining chunks that remain unchanged in both the database and the PDF folder

The source_id_key is set to 'source', ensuring that each chunk is associated with its original PDF filename for tracking and management.

# Used to manage the contents of the Chroma vector store. 
record_manager = SQLRecordManager(namespace, db_url=f"sqlite:///{vector_store_dir}/chroma.db")
record_manager.create_schema()

# Add new or updated chunk embeddings to the database 
index_result = index(all_splits, record_manager, vector_store, cleanup='full', source_id_key='source')

Creating the Prompt

I created an prompt that will be used whenever the agent is asked a question. Here we ‘configure’ the LLM to be an agent that answers a question based on the content. The agent should answer honestly if the answer cannot be derived from the content. The user’s question, and the retrieved context, are substituted at runtime into the {question} and {context} parts of the prompt.

prompt = PromptTemplate.from_template("You are an assistant for question-answering tasks. \
                                       Use the following pieces of retrieved context to answer the question. \
                                       If you don't know the answer, say that you don't know.\
                                       \nQuestion: {question} \
                                       \nContext: {context} \nAnswer:")

Managing State

From what I understand, LangGraph applications manage state through a typed dictionary. The intention is for the input, intermediate and output data to be kept track of with this. For our application, the question (input data), context (intermediate data transferred between steps), and answer (output data) are managed with this structure.

# Keep track of the state of the input question, retireved context and genrated answer 
class State(TypedDict):
    question: str
    context: List[Document]
    answer: str

Application Steps

Next we define the steps to be taken, LangChain refers to these as nodes. In our case we have a retrieval and generate node. Both steps are provided with the state of our application, and so have access to the question, context and answer.

The retrieve step takes the users question and performs a similarity search of our vector store. The result of the search is the four most relevant chunks of our documents. Four just happens to be the default value for the number of chunks to return and can be increased at will. I have not experimented with this.

This output of this node forms the context for our user’s question.

def retrieve(state: State):
    # Search the vector store for the most relevant documents to the input question
    retrieved_docs = vector_store.similarity_search(state["question"])
    # This forms the context for our search
    return {"context": retrieved_docs}

The generate step fills our prompt outline with the users question and the context that was retrieved. This is then passed as the prompt to the LLM which provides a response to the user’s question.

def generate(state: State):
    # Join all the documents in the context into a single string
    docs_content = "\n\n".join(doc.page_content for doc in state["context"])
    # Generate the answer to the input question
    messages = prompt.invoke({"question": state["question"], "context": docs_content})
    # Ask LLM to answer the question
    response = llm.invoke(messages)
    # LLM response forms our answer
    return {"answer": response}

Connecting the Steps

These individual steps (or nodes) are then compiled into a graph forming a sequence of steps. Here a StateGraph, “a graph whose nodes communicate by reading and writing to a shared state,” is used. Our application starts with the retrieval step and then performs the generation step.

# Compile application into a graph object that connects the retrieval and generation steps into single sequence. 
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Asking the Question

We can then use our graph to form a very basic application that repeatedly asks for the user to ask a question relating to the documents provided. LangGraphs support multiple invocation methods, in this case I have chosen to use the ‘stream’ method that lets me get the output of each step as the graph is executed.

# Ask user to enter a question
question = input("Please enter a question relating to the documents you have provided:")

while question:
    # Crudely print the context and the answer to the question.
    for step in graph.stream({"question": question}, stream_mode="updates"):
        print(f"{step}\n\n----------------\n")

    # Ask user to enter another question
    question = input("Please enter a question relating to the documents you have provided:")

Example Output

As a test I created a knowledge-base of two PDF’s containing the instructions to two fictional board games; generated by ChatGPT. You can see, from the output below, the question that I asked, the 4 most relevant chunks of context that were retrieved (alongside their metadata), and the useful (correct) answer that the application was able to provide based on the context.

Fictional instructions from the Realm of the Arcane Lords board game.

{
  "question": "In realm of the arcane lords, what determines the order of play?",
  "context": [
    {
      "content": "Realm of the Arcane Lords: A Game of Magic, Strategy, and Diplomacy Objective of the Game: The ultimate goal of \"Realm of the Arcane Lords\" is to ascend to the title of Supreme Arcane Lord by accumulating Arcane Power, forming strategic alliances, and tactically outmaneuvering your opponents across the mystical lands of Eldorin. A player wins by either amassing 100 Arcane Points (AP) or by securing dominance over three of the four Ancient Elemental Obelisks scattered across the game board. Components: \u2022 1 Large Game Board depicting the world of Eldorin, divided into 48 Territories \u2022 5 Sets of 15 Territory Control Markers per player \u2022 200 Arcane Power Tokens (AP) \u2022 4 Elemental Obelisk Tokens (Earth, Fire, Water, Air) \u2022 120 Spell Cards (divided into Attack, Defense, Support, and Forbidden Magic categories) \u2022 1 20-sided die (D20) and 3 6-sided dice (D6) \u2022 100 Resource Cards (Gold, Crystals, Herbs, and Relics) \u2022 6 Player Faction Boards representing different mystical orders \u2022 50 Miniature",
      "metadata": {
        "creationdate": "D:20250322185747Z00'00'",
        "creator": "PyPDF",
        "moddate": "D:20250322185747Z00'00'",
        "page": 0,
        "page_label": "1",
        "producer": "macOS Version 14.6.1 (Build 23G93) Quartz PDFContext",
        "source": "pdf_docs/Realm of the Arcane Lords.pdf",
        "start_index": 0,
        "total_pages": 2
      }
    },
    {
      "content": "territories. \u2022 Forbidden Magic Spells can only be used once per game and carry significant risk (e.g., backfiring catastrophically). \u2022 Players may attempt to Bargain with the Elders\u2014a mysterious game mechanic where a player rolls three D6 dice and consults the Elder\u2019s Fate Table to receive a boon or curse. Final Thoughts: \"Realm of the Arcane Lords\" is a game that rewards strategy, diplomacy, and careful planning. Players must weigh the risks of confrontation against the benefits of alliance and resource gathering. Every choice carries weight, and only the most cunning and powerful will rise to claim the title of Supreme Arcane Lord!",
      "metadata": {
        "creationdate": "D:20250322185747Z00'00'",
        "creator": "PyPDF",
        "moddate": "D:20250322185747Z00'00'",
        "page": 1,
        "page_label": "2",
        "producer": "macOS Version 14.6.1 (Build 23G93) Quartz PDFContext",
        "source": "pdf_docs/Realm of the Arcane Lords.pdf",
        "start_index": 794,
        "total_pages": 2
      }
    },
    {
      "content": "8. Players take turns placing one Miniature Wizard onto an unoccupied territory until all starting positions are chosen. Gameplay Mechanics: Each turn consists of four phases: 1. The Arcane Planning Phase: o Players may exchange resources, form temporary alliances, or trade Spell Cards. o Players may activate passive abilities from their faction board. o Players may draft one new Spell Card from the deck. 2. The Tactical Movement Phase: o Players may move their wizards across the board, with a maximum movement of two spaces per turn. o Entering an opponent-controlled territory initiates a Duel of Arcane Might (combat sequence). 3. The Duel of Arcane Might: o The attacker rolls a D20 and adds any applicable bonuses from Spell Cards, resources, or faction abilities.",
      "metadata": {
        "creationdate": "D:20250322185747Z00'00'",
        "creator": "PyPDF",
        "moddate": "D:20250322185747Z00'00'",
        "page": 0,
        "page_label": "1",
        "producer": "macOS Version 14.6.1 (Build 23G93) Quartz PDFContext",
        "source": "pdf_docs/Realm of the Arcane Lords.pdf",
        "start_index": 1602,
        "total_pages": 2
      }
    },
    {
      "content": "categories) \u2022 1 20-sided die (D20) and 3 6-sided dice (D6) \u2022 100 Resource Cards (Gold, Crystals, Herbs, and Relics) \u2022 6 Player Faction Boards representing different mystical orders \u2022 50 Miniature Wizards representing each player\u2019s controlled sorcerers \u2022 1 Rulebook with extended lore Setup: 1. Each player selects a mystical order and receives the corresponding Faction Board. 2. Players each take 15 Territory Control Markers in their faction\u2019s color. 3. Shuffle and place the Spell Cards in their respective piles. 4. Shuffle the Resource Deck and place it near the board. 5. Distribute 10 Arcane Power Tokens to each player. 6. Randomly place the four Elemental Obelisk Tokens in different quadrants of the board. 7. Players roll the D20 to determine the order of play; highest roll goes first. 8. Players take turns placing one Miniature Wizard onto an unoccupied territory until all starting positions are chosen. Gameplay Mechanics: Each turn consists of four phases: 1. The Arcane Planning",
      "metadata": {
        "creationdate": "D:20250322185747Z00'00'",
        "creator": "PyPDF",
        "moddate": "D:20250322185747Z00'00'",
        "page": 0,
        "page_label": "1",
        "producer": "macOS Version 14.6.1 (Build 23G93) Quartz PDFContext",
        "source": "pdf_docs/Realm of the Arcane Lords.pdf",
        "start_index": 804,
        "total_pages": 2
      }
    }
  ],
  "answer": "According to the context, players roll a D20 to determine the order of play; the highest roll goes first."
}

Asking the LLM the same question without any context of our made up game, of course, results in a made-up answer.

“In realm of the arcane lords, what determines the order of play? In the board game "Realm of the Arcane Lords", the order of play is determined by a unique mechanism called the "Arcane Order" track. Each player has three Arcane Points (AP) that they can use to determine their turn. The AP are distributed at the beginning of the game, and players take turns playing cards from their deck in a specific sequence based on the number of AP they have available.“

RAG Recap

I've really enjoyed learning about RAG, and I see it as a powerful tool that complements. For me, its biggest potential lies in navigating complex proprietary software documentation, things like company policies, dense technical manuals, and inaccessible stacks of information. With RAG, all of this could become easily searchable through a simple Q&A session in a more mature application.

1
Subscribe to my newsletter

Read articles from Gregor Soutar directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Gregor Soutar
Gregor Soutar

Software engineer at the UK Astronomy Technology Centre, currently developing instrument control software for MOONS, a next-generation spectrograph for the Very Large Telescope (VLT).