5 Practical Ways Developers Can Automate PDF Tasks in Python

Casie LiuCasie Liu
5 min read

Automating PDF files has become an essential way to reduce repetitive work and improve productivity. For developers handling reports, contracts, or invoices, manual operations are often time-consuming and error-prone. Thanks to its simplicity and vast ecosystem, Python provides multiple libraries that make PDF tasks—such as splitting, merging, extracting, and generating—much easier. In this article, we’ll explore five practical methods, ranging from built-in tools to professional libraries, to help you automate PDF tasks with Python.

Using Built-in Libraries (os, subprocess) to Call External Tools

When working with Python, the standard libraries like os and subprocess don’t provide direct support for handling PDF files. However, you can still leverage them to call external command-line tools, which makes it possible to perform basic PDF operations such as merging, splitting, or converting files.

The following example demonstrates how to convert a PDF file into a .txt document with the help of a command-line tool:

import subprocess
import os

input_path = r" \input\Booklet.pdf"
output_path = r" \output\Booklet-1.txt"

# Call the command-line: pdftotext
subprocess.run(["pdftotext", input_path, output_path])

if os.path.exists(output_path):
    print("Text extracted successfully:", output_path)

Note: This code relies on the third-party command-line utility Poppler for Windows.

Using PyPDF2 for Basic PDF Manipulation

After exploring Python’s built-in libraries, let’s move on to one of the most popular open-source Python PDF libraries — PyPDF2. This lightweight and widely used library makes it simple to handle common PDF tasks, such as merging multiple files, splitting documents, rotating pages, and extracting text. It’s a great starting point for anyone who wants to perform basic PDF manipulations without relying on heavy external tools.

The code below demonstrates how to split a PDF using PyPDF2 and save the first two pages as a new PDF document:

from PyPDF2 import PdfReader, PdfWriter

input_path = r" \input\Booklet.pdf"
output_path = r" \output\Booklet_split.pdf"

# Create a PdfReader object and read the source PDF
reader = PdfReader(input_path)

# Create a PdfWriter object
writer = PdfWriter()

# Split the PDF and add the first two pages to the writer
for i in range(2):  
    writer.add_page(reader.pages[i])

# Write the split PDF to a new file
with open(output_path, "wb") as f:
    writer.write(f)

print("Split PDF created:", output_path)

Here’s the preview comparing the split PDF with the original document:

Split a PDF in Python with PyPDF2

Using pdfplumber for Accurate Text Extraction

PDF files are often difficult to edit directly, which is why many people look for ways to extract their contents, such as text or tables, for reuse in other contexts. This is where pdfplumber comes in handy. It is a Python open-source library designed specifically for text and table extraction, providing highly accurate results while preserving the original structure—making it a great choice for scenarios that require reliable data extraction.

The code below demonstrates how to extract text from a PDF using pdfplumber and print it to the console:

import pdfplumber

input_path = r" \input\Booklet.pdf"

with pdfplumber.open(input_path) as pdf:
    page = pdf.pages[1]
    text = page.extract_text()
    print("Extracted text from first page:\n", text)

Using ReportLab for PDF Creation and Customization

So far, we have focused on extracting or converting content from existing PDF documents, or splitting pages. Beyond handling existing files, many scenarios require dynamically generating PDF documents—for example, automatically creating reports, adding watermarks, or inserting charts and images. ReportLab is a powerful Python library that allows developers to create PDF files from scratch and customize their content and layout extensively. With ReportLab, you can easily generate personalized PDFs to meet a variety of business or presentation needs.

The Python code below shows how to create a PDF file using ReportLab:

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4

output_path = r" \output\Generated.pdf"

c = canvas.Canvas(output_path, pagesize=A4)
c.drawString(100, 750, "Hello, this is a PDF created using ReportLab!")
c.drawString(100, 730, "You can add text, images, and even charts.")
c.showPage()
c.save()

print("New PDF generated:", output_path)

Here’s the preview of the output file:

Create a PDF Using ReportLab in Python

Using Professional Libraries like Spire.PDF for Advanced Scenarios

Beyond open-source libraries, there are also commercial options for Python, such as Spire.PDF for Python, which offers a comprehensive set of features covering nearly all common Adobe Acrobat operations like PDF encryption, digital signatures, and format conversion. It also supports advanced tasks that Acrobat struggles with, such as batch form field processing, generating dynamic PDFs with complex charts and automating PDF report creation. For professional, automated, and highly customizable PDF processing, Spire.PDF delivers both flexibility and efficiency.

To give you a clearer idea of how it works, let’s look at two code examples that demonstrate how Spire.PDF handles PDF tasks in Python.

The code example here demonstrates how to add a digital signature to a PDF file:

from spire.pdf.common import *
from spire.pdf import *

# Load a PDF
doc = PdfDocument()
doc.LoadFromFile("/input/Booklet.pdf")

# Create a signature maker
signatureMaker = PdfOrdinarySignatureMaker(doc, " /alice.pfx", "e-iceblue")

# Configure the signature properties like the signer's name, contact information, location and signature reason
signature = signatureMaker.Signature
signature.Name = "Alice"
signature.ContactInfo = "+86 12345678"
signature.Location = "China"
signature.Reason = "I am the author."

# Create a custom signature appearance
appearance = PdfSignatureAppearance(signature)
appearance.NameLabel = "Signer: "
appearance.ContactInfoLabel = "Phone: "
appearance.LocationLabel = "Location: "
appearance.ReasonLabel = "Reason: "
appearance.SignatureImage = PdfImage.FromFile("/signature2.png")
appearance.GraphicMode = GraphicMode.SignImageAndSignDetail
appearance.SignImageLayout = SignImageLayout.none

# Get the first page
page = doc.Pages[0]

# Add the signature to a specified location of the page
signatureMaker.MakeSignature("Signature by Alice", page, 90.0, 600.0, 260.0, 100.0, appearance)

# Save the signed document
doc.SaveToFile("/output/Signed.pdf")
doc.Close()

Here’s a result preview:

Add a Digital Signature in PDF Using Spire.PDF

The code example below shows how to encrypt a PDF with a open password and a permission password:

from spire.pdf.common import *
from spire.pdf import *

# Load a PDF file
doc = PdfDocument()
doc.LoadFromFile("/input/Booklet.pdf")

# Encrypt the PDF file with an open password and a permission password
doc.Security.Encrypt("openPsd", "permissionPsd", PdfPermissionsFlags.FillFields, PdfEncryptionKeySize.Key128Bit)

doc.SaveToFile("/output/Encrypted.pdf", FileFormat.PDF)

Here’s the result preview:

Protect a PDF by Encrypting It Using Spire.PDF

The Conclusion

In summary, Python offers a wide range of solutions for working with PDFs, from open-source libraries that cover basic needs to commercial tools that provide advanced features for more complex workflows. Depending on the project requirements—whether it’s simple text extraction, document conversion, or building fully automated PDF workflows—developers can choose the option that best balances functionality, ease of use, and performance.

0
Subscribe to my newsletter

Read articles from Casie Liu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Casie Liu
Casie Liu