PDF to Text Converter in Python


PDFs are one of the most widely used document formats — from academic papers to invoices to manuals. But what if you want to extract text from a PDF for further analysis, search, or editing? Manually copying text is inefficient, especially for large documents.
In this article, you'll learn how to create a PDF to Text Converter using Python, complete with a breakdown of how it works. It’s perfect for learners looking to gain hands-on experience with real-world automation using Python.
What You Will Learn
How to read and open PDF files with Python.
How to extract text from each page.
How to save that text into a
.txt
file.Common limitations and how to improve upon the basic version.
Tools You Need
All you need is Python and a single external library:
Step 0: Install Required Library
Open your terminal or command prompt and run:
pip install PyPDF2
This installs PyPDF2
, a powerful and beginner-friendly library for working with PDF files.
How the Script Works (Explained Step-by-Step)
Let’s go through each part of the script so you understand exactly what’s happening and why.
Step 1: Import Required Modules
import PyPDF2
import os
PyPDF2
is used to read the PDF.os
helps us work with file paths, like changing a.pdf
to a.txt
.
Step 2: Define the Path to the PDF File
pdf_path = r"C:\Users\YourName\Documents\sample.pdf"
Replace this with the path to the PDF file you want to extract from. The r
before the string tells Python to treat it as a raw string, which helps when your path includes backslashes (\
) on Windows.
Step 3: Open the PDF File
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
We open the file in binary read mode ('rb'
), which is necessary for reading non-text formats like PDFs. PdfReader
then allows us to access the contents.
Step 4: Extract Text from Each Page
text = ""
for page in reader.pages:
text += page.extract_text() or ""
This loop goes through each page and uses extract_text()
to pull out the readable text. Some PDFs may contain images or scanned content, in which case extract_text()
might return None
. We handle that with or ""
so we don’t get errors.
Step 5: Save the Extracted Text to a File
output_path = os.path.splitext(pdf_path)[0] + "_output.txt"
with open(output_path, 'w', encoding='utf-8') as output_file:
output_file.write(text)
os.path.splitext(pdf_path)[0]
gets the base file name without the.pdf
extension.We add
_output.txt
to make a new file.The file is saved using UTF-8 encoding to support special characters.
Step 6: Notify the User
print("✅ PDF text extracted successfully and saved to:", output_path)
This gives confirmation and tells the user where to find the resulting file.
The Complete Script
Here’s the full code — you can copy, paste, and run this in any Python environment:
import PyPDF2
import os
# === Step 1: Set the path to your PDF file ===
pdf_path = r"C:\Users\YourName\Documents\sample.pdf" # Replace with your actual PDF file path
# === Step 2: Open the PDF file ===
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfReader(file)
# === Step 3: Extract text from each page ===
text = ""
for page in reader.pages:
text += page.extract_text() or "" # Extracts text or adds an empty string if not found
# === Step 4: Save the text to a .txt file ===
output_path = os.path.splitext(pdf_path)[0] + "_output.txt"
with open(output_path, 'w', encoding='utf-8') as output_file:
output_file.write(text)
# === Step 5: Print success message ===
print("✅ PDF text extracted successfully and saved to:", output_path)
What You Should Know
PyPDF2 works best with text-based PDFs. If your PDF contains scanned images or handwriting, the
extract_text()
function won’t retrieve any content.For scanned files, you’ll need an Optical Character Recognition (OCR) tool like
pytesseract
.
Try It Yourself
Download or choose a text-based PDF (e.g., a research paper or resume).
Update the
pdf_path
variable with the correct file path.Run the script.
Open the
.txt
file and check the output!
Conclusion
This small yet powerful project is a great stepping stone into real-world Python scripting. It introduces you to file handling, working with external libraries, and automating everyday tasks. PDF processing is just one of many ways Python can simplify your digital workflow.
Happy Scripting ! !
Subscribe to my newsletter
Read articles from Michelle Mukai directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
