Easiest Guide to build a chatGPT for your PDF documents using GPT-3/3.5
In this guide, we will see how to build a chatGPT for your PDF documents i.e. an AI that will answer your questions based on a particular PDF document.
You could use this to ask questions about your textbooks, ebooks, or anything else as long as it’s in a PDF file format
We will be using
Let’s go.
Table of Contents
- The process to build a chatGPT for your PDF documents
- Requirements to build a chatGPT for your PDF documents
- Install Python packages
- Setup your working directory/folder
- Import the required Python packages
- Process the PDF
- Create embeddings
- Query the PDF document using the embeddings
- Conclusion
The process to build a chatGPT for your PDF documents
There is the main steps we are going to follow to build a chatGPT for your PDF documents
- First, we will extract the text from a pdf document and process it and make it ready for the next step.
- Next, we will use an embedding AI model to create embeddings from this text.
- Next, we will build the query part that will take the user’s question and uses the embeddings created from the pdf document, and uses the GPT3/3.5 API to answer that question.
Requirements to build a chatGPT for your PDF documents
- We will be using OpenAI GPT-3/3.5 API for this. Grab your API key from your OpenAI Account.
- Python 3.x or higher installed on your computer.
Install Python packages
First, install the necessary python packages. Depending on your python installation, you could use pip install or python -m pip install . Run these from your command line program.
The python packages you need to install are:
- PyPDF2
- langchain
- openai
- faiss-cpu
Setup your working directory/folder
create a new directory or folder and create a .env file inside the folder and write below text into it
OPENAI_API_KEY=your-openai-api-key
make sure to replace the text your-openai-api-key with your actual OpenAI API key.
Import the required Python packages
You can do this in a Jupyter Notebook / Google Colab Notebook or a python .py on your computer
Make sure it’s in the same folder as the .env file you created above.
# import the modules from PyPDF2 import PdfReader from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS import os # load .env file from dotenv import load_dotenv load_dotenv()
Process the PDF
We start with reading in the pdf document.
reader \= PdfReader('my_pdf_doc.pdf')
raw_text \= '' for i, page in enumerate(reader.pages): text \= page.extract_text() if text: raw_text += text
Next we split the pdf contents into chunks.
text_splitter \= CharacterTextSplitter(
separator \= "\n",
chunk_size \= 1000,
chunk_overlap \= 200,
length_function \= len,
)
texts \= text_splitter.split_text(raw_text)
Create embeddings
Now, it’s time to create embeddings from the text chunks we created above from the pdf document.
embeddings \= OpenAIEmbeddings()
We then save the embeddings so that we do need not to create them again and again. The below code saves them to the disk. However, you could also save them to various vector databases covered here.
import pickle with open("foo.pkl", 'wb') as f: pickle.dump(embeddings, f)
Query the PDF document using the embeddings
First we load the saved embeddings
with open("foo.pkl", 'rb') as f: new_docsearch \= pickle.load(f)
There are two ways to query the PDF document using mebeddings
Below method will list the most similar chunks that might contain the answer to the query docsearch \= FAISS.from_texts(texts, new_docsearch)
query \= "Your query here" docs \= docsearch.similarity_search(query) print(docs[0].page_content)
Another way to query is to use embeddings to build a prompt and then use LLM model like GPT-3 to answer the question directly.
from langchain.chains.question_answering import load_qa_chain from langchain.llms import OpenAI
chain \= load_qa_chain(OpenAI(temperature\=0), chain_type\="stuff") chain.run(input_documents\=docs, question\=query)
Conclusion
You can use this technique for all kinds of text data beyond just PDFs. You can also the techniques explained here to turn this into a web-based knowledge retrieval system.
Also, here is the complete code used in this guide.
Subscribe to my newsletter
Read articles from Harish Garg directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by