Transcribing YouTube Videos using OpenAI’s Whisper📽️🗣️
Although YouTube has emerged as the standard for video sharing and information gathering, not everyone has the time or capacity to watch a video through to the end. A tool for transcribing these movies can be useful in these situations. Today, we’ll look at how to use AI to create your own YouTube transcriber.
We’ll also look at how Replicate may be used to scale up and offload the transcription process, as well as how to use natural language processing to summarise the finished video transcription.
What is OpenAI’s Whisper? 🗣️🤖
Whisper is “an automatic speech recognition system trained on multilingual and multitask supervised data” created by OpenAI. It transcribes audio and video footage with astounding accuracy using cutting-edge deep learning models, making it simple to glean insightful information from massive amounts of spoken data.
Whisper has a wide range of potential uses, but we’ll be using it especially to record audio from YouTube videos.
Getting started 👶🏻
For these examples, Python 3 will be used because Whisper is available in this dialect.
Virtual Environment Setup 🏞️
Generally speaking, it’s a good idea to separate your package installations when starting a new Python project. By building a virtual environment, we may do this.
python3 -m venv venv
This will create your virtual environment in a folder called venv
. From here, we can then activate it:
. venv/bin/activate
Installing Dependencies 📦
We’ll use pip
to install the packages needed:
pip install openai-whisper openai yt-dlp
openai-whisper — Whisper model and API
openai — GPT-3 interface for natural language processing
yt-dlp — library for extracting YouTube data
Fetching the YouTube Audio Stream 📽️
To give us something to work with, I’ve provided a short example video below of a TED-Ed video.
We can then extract the data streams and remove the audio from the video using the video ID:
import yt_dlp
def download(video_id: str) -> str:
video_url = f'https://www.youtube.com/watch?v={video_id}'
ydl_opts = {
'format': 'm4a/bestaudio/best',
'paths': {'home': 'audio/'},
'outtmpl': {'default': '%(id)s.%(ext)s'},
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'm4a',
}]
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
error_code = ydl.download([video_url])
if error_code != 0:
raise Exception('Failed to download video')
return f'audio/{video_id}.m4a'
def main():
# The video ID of the embedded video above.
file_path = download('bFIVYRfyb3E')
This will download the above video as audio/bFIVYRfyb3E.m4a
Transcribing the Audio File 🤖
Now that we have the audio file on hand, we can simply feed it into Whisper:
import whisper
# You can adjust the model used here. Model choice is typically a tradeoff between accuracy and speed.
# All available models are located at https://github.com/openai/whisper/#available-models-and-languages.
whisper_model = whisper.load_model("base.en")
def transcribe(file_path: str) -> str:
# `fp16` defaults to `True`, which tells the model to attempt to run on GPU.
# For local demonstration purposes, we'll run this on the CPU by setting it to `False`.
transcription = whisper_model.transcribe(file_path, fp16=False)
return transcription['text']
def main():
transcript = transcribe('audio/bFIVYRfyb3E.m4a')
print(transcript)
This will generate the full transcript for the video:
Video to text
Generating a Transcript Summary 📃
In their videos, many YouTube creators incorporate sponsorships, adverts, and filler content. With the aid of natural language processing, we can create a transcript summary that condenses the transcript into a more manageable form. For this example, we will create these summaries using the widely used gpt-3.5-turbo
model.
To create an API key, you must have an OpenAI account. You will be given some free usage as a new user to try out the API.
import openai
openai.api_key = "<YOUR_OPENAI_API_KEY>"
def generate_summary(transcript: str) -> str:
# Generate a summary of the transcript using OpenAI's gpt-3.5-turbo model.
resp = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f'Summarize this: {transcript}'},
]
)
return resp['choices'][0]['message']['content']
def main():
transcript = transcribe('audio/bFIVYRfyb3E.m4a')
summary = generate_summary(transcript)
print(summary)
Although there may be variations in the outcomes, the following is an illustration of what to anticipate:
Summary
Have fun! To customize the answer to your objectives, you can adjust this in a variety of ways.
Optional: Scale with Replicate ✌🏻
We can run open-source models in the cloud thanks to Replicate. This might be a priceless tool for expanding your application, depending on your use case.
Create an account with Replicate to get an API token if you want to utilise it. We install the Replicate client using pip in order to use it in our code:
pip install replicate
Now that Whisper through Replicate is enabled, we can adjust the transcribe code above to take this into consideration instead of only using the local CPU:
def transcribe(file_path: str, use_replicate: bool = False) -> str:
if use_replicate:
client = replicate.Client(api_token='xxxxx')
transcription = client.run( 'openai/whisper:30414ee7c4fffc37e260fcab7842b5be470b9b840f2b608f5baa9bbef9a259ed',
input={'audio': open(file_path, 'rb')}, language='en', model='base'
)['transcription']
else:
transcription = whisper_model.transcribe(file_path, fp16=False)['text']
return transcription
Conclusion 💭
Whisper is a strong tool for creating transcribers that can effectively glean insights from audio and video sources. Whisper can help you optimize your workflow and discover fresh insights from your content, whether you’re a content creator trying to reuse your video content, a researcher analysing data from video interviews, or anybody else who deals with spoken data.
Tell me what you believe! 🤔 Enjoy this post? 😃 Hungry for more? ✅
Don’t miss out by subscribing for more quality content delivered right to your inbox!
Connect with Me on Social Media
🐦 Follow me on Twitter: devangtomar7
🔗 Connect with me on LinkedIn: devangtomar
📷 Check out my Instagram: be_ayushmann
Ⓜ️ Checkout my blogs on Medium: Devang Tomar
#️⃣ Checkout my blogs on Hashnode: devangtomar
🧑💻 Checkout my blogs on Dev.to: devangtomar
Subscribe to my newsletter
Read articles from Devang Tomar directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Devang Tomar
Devang Tomar
👨💻 Creative Software Engineer with 5 years of experience in domains including CI/CD, Networking, Cloud computing, Development, Virtualization, and Linux administration. Passionate about developing forward-thinking solutions to tomorrow's productivity problems. Resourceful and adaptable approach to challenges. 🤹♀️ Skill stack: • Cloud ☁️ : Azure, GCP • Databases 🗃️ : MySQL, PostgreSQL, Elasticsearch • Language 🐍 : Python, JavaScript • Configuration management, deployment & IaC 🛡️ : Ansible, Terraform • Container and orchestration 🐳 : Docker, Kubernetes • Version Control 🗂️ : Git, GitHub • CI/CD 🔄 : Jenkins, GitHub actions, ArgoCD • Continuous Monitoring 📊 : Grafana, ELK, Prometheus 📚 Currently learning: Web development, MLOPS