How to Highlight Text in PDF with Python: Step-by-step Instructions

Casie LiuCasie Liu
5 min read

When researching literature or reviewing reports, marking up key information is essential. However, unlike Word documents, editing PDF files can be challenging. While many PDF tools offer built-in highlighting features, manually going through the text can still be time-consuming. If you're looking for a more efficient way to simplify this process, why not try automating it with Python? In this article, we'll show you how to highlight text in PDF files using Python, helping you save time and avoid missing important details.


Python Library to Highlight Text in PDF

Before diving into the main topic, it's important to complete some preparatory work. The code examples in this article will use Spire.PDF for Python to demonstrate how to highlight text in PDF files. It is one of the best Adobe Acrobat alternatives with powerful features. This tool allows users to create, edit, convert PDFs easily and quickly, and more.

You can install it using the pip command: pip install Spire.PDF

How to Highlight Text in PDF: Specific Text in the Entire PDF

Let's start by exploring the most common scenario when highlighting text in a PDF: marking specific text throughout the entire file. To achieve this, you need to iterate through each page to ensure no instances are missed, locate the target keywords or phrases, and then apply highlighting. Below are the detailed steps for this process.

Steps to highlight words in PDF documents:

  • Create an object of the PdfDocument class and load a PDF sample with the PdfDocument.LoadFromFile() method.

  • Loop through all pages in the PDF document, and get a certain page using the PdfDocument.Pages.get_Item() method.

  • Find all matching text on the page using the Pages.FindText() method, and retrieve the results from the Finds property.

  • Highlight text on the page by calling the ApplyHighLight() method.

  • Save the updated document as a new PDF file with the PdfDocument.SaveToFile() method, and release the resource.

Here is a code example of highlighting certain words in PDFs:

from spire.pdf import *
from spire.pdf.common import*


# Create an object of PdfDocument class and load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("/sample.pdf")

# Loop through the pages in the PDF document
for i in range(pdf.Pages.Count):
    # Get a page
    page = pdf.Pages.get_Item(i)
    # Find all occurrences of specific text on the page
    result = page.FindText("cloud server", TextFindParameter.IgnoreCase).Finds
    # Highlight all the occurrences
    for text in result:
        text.ApplyHighLight(Color.get_Cyan())

# Save the document
pdf.SaveToFile("/FindHighlight.pdf")

# Release the resource
pdf.Close()
💡
To change the highlight color in PDF documents, simply set the color parameter in the ApplyHighlight() method. In this example, it's set to cyan, but you can change it to red using Color.get_red().

Highlight Text in the Entire PDF

How to Highlight Words in PDF: in a Specified Area

In some cases, highlighting text throughout the entire PDF may not be necessary. Instead, you might only need to focus on a specific section, such as when reviewing a particular page or paragraph. In this section, we will show you how to highlight text in specified areas, providing detailed steps and a code example.

Steps to highlight text in a specified area on a PDF page:

  • Instantiate a PdfDocument class and read the source PDF file from the disk using the PdfDocument.LoadFromFile() method.

  • Get a page with the PdfDocument.Pages.get_Item() method.

  • Define a rectangle area on the page by configuring parameters of the RectangleF class.

  • Find all occurrences in the area using the Pages.FindText() method.

  • Iterate through the collection of found text instances and apply highlight effect using the ApplyHighLight() method.

  • Write the PDF document to the disk and release resources.

Below is the code example of highlighting text in a rectangle on the first page with a green color:

from spire.pdf import *
from spire.pdf.common import*


# Create an object of PdfDocument class and load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("/sample.pdf")

# Get the first page
pdfPageBase = pdf.Pages.get_Item(0)

# Define a rectangular area
rctg = RectangleF(0.0, 0.0, pdfPageBase.ActualSize.Width, 300.0)

# Find all the occurrences of specified text in the rectangular area
findCollection = pdfPageBase.FindText(rctg,"cloud server",TextFindParameter.IgnoreCase)

# Find text in the rectangle
for find in findCollection.Finds:
        #Highlight searched text
        find.ApplyHighLight(Color.get_Green())

# Save the document
pdf.SaveToFile("/FindHighlightArea.pdf")

# Close the document
pdf.Close()

Highlight Text in a Specified Area

How to Highlight Text in PDF: Regular Expression

When working with PDF documents, you may face a specific challenge, such as highlighting all occurrences of times, phone numbers, or keywords that follow a pattern. The previous methods won’t suffice in these cases, as they only match the exact text. Instead, regular expressions can be used to quickly locate and highlight such patterns. In this chapter, we’ll show you how to apply regular expressions to match and highlight text in a PDF, providing a more flexible and efficient solution for complex searches. Let’s dive into the details.

Steps to highlight text in PDF files with regular expression:

  • Create a PdfDocument object and specify the file path to load source documents with the PdfDocument.LoadFromFile() method.

  • Set the regular expression as you need.

  • Locate all text that matches using the Pages.FindText() method.

  • Highlight the text in the PDF documents by calling the ApplyHighLight() method.

  • Save the modified file to the local storage and close the document.

Here is an example of highlighting two words that after the mark “*”:

from spire.pdf import *
from spire.pdf.common import*


# Create an object of PdfDocument class and load a PDF document
pdf = PdfDocument()
pdf.LoadFromFile("/sample.pdf")

# Specify the regular expression that matches two words after *
regex = "\\*(\\w+\\s+\\w+)"

# Get the second page
page = pdf.Pages.get_Item(1)

# Find matched text on the page using regular expression
result = page.FindText(regex, TextFindParameter.Regex)

# Highlight the matched text
for text in result.Finds:
    text.ApplyHighLight(Color.get_DeepPink())

# Save the document
pdf.SaveToFile("/FindHighlightRegex.pdf")

# Close the document
pdf.Close()

Highlight Text with Regular Expression

The Conclusion

This article demonstrates how to highlight text in PDF files using Python, covering three main scenarios: finding text throughout the entire document, locating text in a specific region, and using regular expressions to identify text. Each method includes detailed steps and code examples for your reference. After reading this guide, you'll be able to easily find and highlight multiple words in various situations with confidence and efficiency!

0
Subscribe to my newsletter

Read articles from Casie Liu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Casie Liu
Casie Liu