Retrieval Augmented Generation
Table of contents
This blog post guides you through creating a Jupyter Notebook that utilizes OpenAI's GPT-3.5 model to answer your questions about YouTube videos.
Why a Jupyter Notebook?
Jupyter Notebooks provide an interactive environment for Python coding, making it easier to experiment, visualize data,and explain your code.
Prerequisites
Basic Python knowledge
An OpenAI API key (OpenAI)
Tools and Libraries
Python 3.x
openai
library:pip install openai
pytube
library (optional):pip install pytube
whisper
library (optional):pip install whisper
The Notebook
Set up Environmental variables:
Create .env file to store API Keys
OPENAI_API_KEY = Here-Goes-Your-API-KEY
PINECONE_API_KEY = Here-Goes-Your-API-KEY
- Import Libraries:
import os
from dotenv import load_dotenv
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
YOUTUBE_VIDEO = "https://youtu.be/BrsocJb-fAo?si=veWyKdjyngCVtDU7"
- Setting up the model
from langchain_openai.chat_models import ChatOpenAI
model = ChatOpenAI(openai_api_key = OPENAI_API_KEY, model="gpt-3.5-turbo")
Generate the transcription
import tempfile
import whisper
from pytube import YouTube
# only if transcript file does not exist
if not os.path.exists("video_transcript.txt"):
video = YouTube(YOUTUBE_VIDEO)
audio = video.filter(only_audio=True).first()
whisper_model = whisper.load_model("base")
with tempfile.TemporaryDirectory() as tmpdir:
file = audio.download(output_path=tmpdir)
transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()
with open("video_transcript.txt", "w") as file:
file.write(transcription)
- Test Generated Transcription
with open("video_transcript.txt") as file:
transcription = file.read()
transcription[:20]
- Load Transcription
from langchain_community.document_loaders import TextLoader
loader = TextLoader("video_transcript.txt")
text_transcription = loader.load()
Split Transcription
Generally, the document is too large, splitting is required to handle it. Recursive Character Splitter, splits the document into chunks of a fixed size. Let's split the transcription into chunks of 100 characters with an overlap of 20 characters
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
text_transcription_documents = text_splitter.split_documents(text_transcription)
- Set up a Vector Store
We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a vector store.
A vector store is a database of embeddings that specializes in fast similarity searches.
from langchain_openai.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
from langchain_community.vectorstores import DocArrayInMemorySearch
vectorstore = DocArrayInMemorySearch.from_documents(text_transcription_documents, embeddings)
- Use of Pinecone
Vector Store is an in-Memory vector store, we require a vector store that can handle large amounts of data and perform similarity searches at scale. For this example, we'll use Pinecone, create an account, set up an index, get an API key, and set it as an environment variable PINECONE_API_KEY, then, we can load the transcription documents into Pinecone:
from langchain_pinecone import PineconeVectorStore
index_name = "FROM PINECONE CONFIGURATION"
pinecone = PineconeVectorStore.from_documents(
text_transcription_documents, embeddings, index_name=index_name
)
Define the Chain
Chain for processing questions and answers using a language model and potentially retrieval system.
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
template = """
Answer the question based on the context below.
If you can't answer the question, reply "I don't know"
Context: {context}
Question : {question}
"""
prompt = ChatPromptTemplate.from_template(template)
from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()
chain = (
{"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
| prompt
| model
| parser
)
This chain involves potentially retrieving information (
pinecone.as
_retriever()
) for the context and usesRunnablePassthrough
to keep the question unchanged.The retrieved context (
context
) and the question (question
) are combined as a dictionary in the first step.The dictionary is then piped (
|
) to theprompt
object, which will use the context and question to generate the final prompt for the language model.The generated prompt is then piped to the language model (
model
) for processing.Finally, the model's output is piped to the output parser (
parser
) to be interpreted as a string.
Ask
chain.invoke("Ask your question?")
Conclusion
This Blog post demonstrates a basic framework for building a question-answering system for YouTube videos using Python and GPT-3.5. Experiment with the prompt template and explore additional functionalities to enhance this system!
Subscribe to my newsletter
Read articles from Nestor Rojas directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Nestor Rojas
Nestor Rojas
Experienced .Net Developer with +8 years of expertise in .Net Core, web applications, APIs, and SQL Server. Strong understanding of design patterns and object-oriented design. Proficient in C# and JavaScript programming languages, with a track record of developing and maintaining complex software systems. Passionate about building reliable, scalable, and user-friendly software that meets business requirements