How to build a PDF chatbot with Langchain ๐Ÿฆœ๐Ÿ”— and FAISS

Kevin NaidooKevin Naidoo
6 min read

While ChatGPT and other similar models are great and can give you relatively good information on any topic. A common problem is hallucination and verifying the source of the model's response.

To improve the accuracy and limit the scope of these LLMs to a specific domain, we can use a process called RAG.

In this guide, we will build a small document GPT and go over all of the essential concepts you need to understand.

What is RAG?

When you ask ChatGPT a question, it is drawing knowledge from its entire training dataset and therefore, there is no way to scope that data to have some control over its responses.

RAG or "Retrieval-Augmented Generation" is a machine learning process that allows you to feed large language models like ChatGPT or LLAMA2 with a custom dataset.

What is Rag

When the user prompts the model, you can then instruct the model to retrieve the answer from your custom dataset.

This then leads to better accuracy, and you can also pull in more up-to-date information unlike ChatGPT (the free version anyway) which is only giving you responses from training data that's a year or two old.

What are vector embeddings?

Generally in machine learning, we deal with vectors instead of actual text or words. Vector embeddings are a numerical representation of text.

Here is an example of how to generate vector embeddings:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
embeddings = model.encode(["Hello World!"])
print(embeddings)

If you run this code, you will get a large Numpy array of floats such as the following:

[[ 1.91737153e-02  2.87365261e-02 -1.23540871e-02  1.58221070e-02
   7.90899321e-02 -9.76034254e-03  7.49560725e-03  5.52258380e-02
   1.88754648e-02 -2.63798535e-02 -2.68068276e-02 -3.33473049e-02
  -3.00286859e-02  3.89383249e-02  7.69484490e-02 -7.68074691e-02....

What's the point of vector embeddings? Since vectors are numbers, this allows us to perform various mathematical calculations to generate or search through text with meaning. This is actually at the core of sentence transformers which powers ChatGPT and other similar LLM models.

One common use case of vector embeddings is for search. The idea is to vectorize data and store these in a vector database such as "FAISS" or "Qdrant".

Instead of using a normal full-text search with keywords, you can now perform a similarity search using the KNN algorithm (K-Nearest Neighbor).

For example, in a regular keyword search using SOLR, if you searched for "Give me a list of iOS-based phones?". You probably won't get any good results because SOLR does not understand the meaning of words and is just looking for similarly spelled words and synonyms.

However, with a vector-based search, you can store meaning based on the location of each word in the context of a phrase or sentence.

Thus, this allows for complex searching capabilities using cosine similarity and other such mathematical algorithms, which in our example would associate "iOS-based" with the brand "Apple" and "phones" with the "iPhone" or "Smartphone".

What is FAISS?

FAISS; developed by Meta, is a library to store and search vector embeddings. Similar to how you would store documents in a keyword search engine like SOLR or Elasticsearch, FAISS allows you to store vector embeddings and provides neat Python bindings to perform similarity searches.

Why do we need FAISS? Most LLMs have a limitation on how many tokens or words your prompt can contain, thus if you have hundreds of large PDFs, it's almost impossible to give the LLM all this data as context in one go.

A better approach is to use FAISS to return only text that is relevant to the user's prompt. This reduces the amount of data you provide as context and will speed up your LLM queries in general.

What is Langchain?

Langchain is a Python library that provides various utilities to help you build applications with LLMs. Such utilities include simplifying generating vector embeddings, prompts, chunking text, formatting the LLM response, and more.

Furthermore, Langchain provides standardization for 3rd party vendors such as OpenAI or Mistral, giving you an almost plug-and-play architecture that needs minimal code changes when switching between models.

Getting our hands dirty with code

First off, you will need to install a few pip packages:

pip install openai
pip install langchain
pip install langchain-openai
pip install PyPDF2
pip install langchain-community

# Replace "gpu" with "cpu" if you don't have a GPU.
pip install faiss-gpu

โ„น๏ธ You will notice 3 pip packages are being installed for Langchain. The "langchain-community" package allows third-party services like OpenAI, Mistral, and community developers to build integrations based on interfaces from the core package.

Next, let's build out a couple of functions to help parse PDFs and convert them to raw text.

from PyPDF2 import PdfReader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

# Will house our FAISS vector store
store = None

# Will convert text into vector embeddings using OpenAI.
embeddings = OpenAIEmbeddings()

def split_paragraphs(rawText):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        is_separator_regex=False,
    )

    return  text_splitter.split_text(rawText)

def load_pdfs(pdfs):
    text_chunks = []

    for pdf in pdfs:
        reader = PdfReader(pdf)
        for page in reader.pages:
            raw = page.extract_text()
            chunks = split_paragraphs(raw)
            text_chunks += chunks
    return text_chunks

Splitting the text into chunks is necessary because when we do a similarity search, we match and return a much smaller amount of text in batches.

Furthermore, LLMs do have restrictions on the amount of tokens you can send per request, therefore chunking helps to mitigate the risk of overloading prompts.

Creating a store

Now that we have raw text from our PDFs, we can convert this text into vector embeddings and store them in our FAISS store.

Continuing from the script above:

def main():
    list_of_pdfs = ["test1.pdf", "test2.pdf"]
    text_chunks = load_pdfs(list_of_pdfs)

    # Index the text chunks in our FAISS store.
    # OpenAIEmbeddings will be used automatically to convert
    # Each chunk of text into vector embeddings using
    # OpenAI APIs. 
    store = FAISS.from_texts(text_chunks, embeddings)

    # Write our index to disk.
    store.save_local("./vectorstore")

if __name__ == "__main__":
    main()

Can you see the power of Langchain? Langchain automatically connects to OpenAI and does all the heavy lifting of encoding our text into vector embeddings and then storing them in the FAISS index.

Now for the fun part, chatting with our PDFs!

from langchain_community.chat_models import ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA

# Settings contains the env: OPENAI_API_KEY
import settings

# Load the saved FAISS store from the disk.
store = FAISS.load_local("vectorstore",  OpenAIEmbeddings(), allow_dangerous_deserialization=True)

# Create an instance of a ChatGPT turbo model

llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0)

# Build our Langchain chain instance.
chain = RetrievalQA.from_chain_type(
   llm=llm,
   retriever=store.as_retriever()
)

# Ask the LLM a question.
result = chain({"query": "what are exchange control?"})
print(result)

The "RetrievalQA" instance performs a similarity search against our FAISS index and provides this as context to OpenAI.

To keep things simple, I am just using a basic prompt. You can however customize the prompt template. There is a wide variety of prompts that Langchain supports, a common use case is to give the chatbot a persona e.g. "You are an expert in Python programming".

Learn more about other prompt templates here.

โš ๏ธ Notice I am using "allow_dangerous_deserialization", this allows FAISS to load code from the vectorstore files on disk. Usually, you should not use this setting if data is coming from users. Instead, use a proper vector database to persist your document embeddings.

What about remembering previous conversations?

Users in general will ask related questions and it would be a pain to keep repeating the same messages. Luckily, Langchain makes LLM memory a breeze.

To store and provide the LLM with conversation history we can do the following:

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()

//conversations will be from memory or Redis.

for msg in conversations:
     memory.save_context(
    {"input":msg['human_question']},
    {"output":msg['chatbot_answer']}
)

chain = RetrievalQA.from_chain_type(
   llm=llm,
   retriever=store.as_retriever(),
   memory=memory
)
6
Subscribe to my newsletter

Read articles from Kevin Naidoo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Kevin Naidoo
Kevin Naidoo

I am a South African-born tech leader with 15+ years of experience in software development including Linux servers and machine learning. My passion is web development and teaching. I often love experimenting with emerging technologies and use this blog as an outlet to share my knowledge and adventures. Learn about Python, Linux servers, SQL, Golang, SaaS, PHP, Machine Learning, and loads more.