PDF to Text Converter in Python

Michelle MukaiMichelle Mukai
4 min read

PDFs are one of the most widely used document formats — from academic papers to invoices to manuals. But what if you want to extract text from a PDF for further analysis, search, or editing? Manually copying text is inefficient, especially for large documents.

In this article, you'll learn how to create a PDF to Text Converter using Python, complete with a breakdown of how it works. It’s perfect for learners looking to gain hands-on experience with real-world automation using Python.

What You Will Learn

  • How to read and open PDF files with Python.

  • How to extract text from each page.

  • How to save that text into a .txt file.

  • Common limitations and how to improve upon the basic version.

Tools You Need

All you need is Python and a single external library:

Step 0: Install Required Library

Open your terminal or command prompt and run:

pip install PyPDF2

This installs PyPDF2, a powerful and beginner-friendly library for working with PDF files.

How the Script Works (Explained Step-by-Step)

Let’s go through each part of the script so you understand exactly what’s happening and why.

Step 1: Import Required Modules

import PyPDF2
import os
  • PyPDF2 is used to read the PDF.

  • os helps us work with file paths, like changing a .pdf to a .txt.

Step 2: Define the Path to the PDF File

pdf_path = r"C:\Users\YourName\Documents\sample.pdf"

Replace this with the path to the PDF file you want to extract from. The r before the string tells Python to treat it as a raw string, which helps when your path includes backslashes (\) on Windows.

Step 3: Open the PDF File

with open(pdf_path, 'rb') as file:
    reader = PyPDF2.PdfReader(file)

We open the file in binary read mode ('rb'), which is necessary for reading non-text formats like PDFs. PdfReader then allows us to access the contents.

Step 4: Extract Text from Each Page

    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""

This loop goes through each page and uses extract_text() to pull out the readable text. Some PDFs may contain images or scanned content, in which case extract_text() might return None. We handle that with or "" so we don’t get errors.

Step 5: Save the Extracted Text to a File

output_path = os.path.splitext(pdf_path)[0] + "_output.txt"
with open(output_path, 'w', encoding='utf-8') as output_file:
    output_file.write(text)
  • os.path.splitext(pdf_path)[0] gets the base file name without the .pdf extension.

  • We add _output.txt to make a new file.

  • The file is saved using UTF-8 encoding to support special characters.

Step 6: Notify the User

print("✅ PDF text extracted successfully and saved to:", output_path)

This gives confirmation and tells the user where to find the resulting file.

The Complete Script

Here’s the full code — you can copy, paste, and run this in any Python environment:

import PyPDF2
import os

# === Step 1: Set the path to your PDF file ===
pdf_path = r"C:\Users\YourName\Documents\sample.pdf"  # Replace with your actual PDF file path

# === Step 2: Open the PDF file ===
with open(pdf_path, 'rb') as file:
    reader = PyPDF2.PdfReader(file)

    # === Step 3: Extract text from each page ===
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""  # Extracts text or adds an empty string if not found

# === Step 4: Save the text to a .txt file ===
output_path = os.path.splitext(pdf_path)[0] + "_output.txt"
with open(output_path, 'w', encoding='utf-8') as output_file:
    output_file.write(text)

# === Step 5: Print success message ===
print("✅ PDF text extracted successfully and saved to:", output_path)

What You Should Know

  • PyPDF2 works best with text-based PDFs. If your PDF contains scanned images or handwriting, the extract_text() function won’t retrieve any content.

  • For scanned files, you’ll need an Optical Character Recognition (OCR) tool like pytesseract.

Try It Yourself

  1. Download or choose a text-based PDF (e.g., a research paper or resume).

  2. Update the pdf_path variable with the correct file path.

  3. Run the script.

  4. Open the .txt file and check the output!

Conclusion

This small yet powerful project is a great stepping stone into real-world Python scripting. It introduces you to file handling, working with external libraries, and automating everyday tasks. PDF processing is just one of many ways Python can simplify your digital workflow.

Happy Scripting ! !

0
Subscribe to my newsletter

Read articles from Michelle Mukai directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Michelle Mukai
Michelle Mukai