Extracting PDF Text, Tables, Images in Python

In real-world system development, working with PDF files often goes far beyond simply reading full-page text. It involves tasks like extracting tables, images, or specific text regions — operations that demand high precision and performance.

This article walks you through how to extract tables, images, and text from a PDF using Python. Whether you're looking to extract pages and images from PDF files or reuse structured data for analysis, this hands-on guide provides practical solutions to help you handle complex PDF extraction tasks efficiently.

Install the Python Library

In this tutorial, we’ll use Spire.PDF for Python to demonstrate how to extract elements from PDF documents. As a standalone third-party library, Spire.PDF does not rely on Microsoft Office. Beyond basic format conversions like PDF to Excel, it also enables advanced operations such as extracting text, tables, and images — the key topics covered in this article.

You can install it via pip:

pip install Spire.PDF

Or install the free version:

pip install spire.pdf.free

The free version has some limitations on file length, but it's sufficient for small-scale tasks.

Extracting PDF Text: Full Pages and Specific Regions

When working with PDF documents, text extraction typically falls into two common scenarios:

Extracting all text from a page or an entire document — for example, retrieving the full content of a contract or processing files in bulk.
Extracting text from a specific region — such as capturing a particular field from a form or invoice.

In this section, we’ll use Spire.PDF for Python to handle both use cases.

Extract All Text from a Page

Text is the most common type of content in PDF files. Whether you're extracting the full content of a contract or handling large volumes of documents, knowing how to accurately extract text is essential. This part will show you how to extract all text from a PDF page using Python quickly.

Full Code Example – Extract All Text from the First Page:

from spire.pdf import *
from spire.pdf.common import *


# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile("/AI-Generated Art_images.pdf")

# Get the first page
page = pdf.Pages.get_Item(0)

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()

# Extract text from the page
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a txt file
extractedText = open("/extracttext-firstpage.txt", "w", encoding = "utf-8")
extractedText.write(text)

# Release resources
extractedText.close()
pdf.Close()

Result Preview:

Python Extract Text from the First Page of PDF

Key Actions Explained:

Create a new PdfDocument object, then load the target PDF file.
Access the first page or iterate through all pages if you need to extract text from the entire document.
Create PdfTextExtractor and PdfTextExtractOptions objects to configure the text extraction process.
Use the PdfTextExtractor.ExtractText() method to extract text from the selected page(s).

Extract Text from a Specific Region in a PDF

Sometimes, you don’t need to extract an entire page — just a specific portion of it. For example, you may want to capture the amount on an invoice, a particular column in a table, or a signature in the corner of a document.

With Spire.PDF, you can easily extract text from a defined rectangular area by specifying its coordinates.

Full Code Example – Extract Text from a Rectangular Region on the First Page:

from spire.pdf import *
from spire.pdf.common import *


# Create a PdfDocument instance
pdf = PdfDocument()

# Load a PDF document
pdf.LoadFromFile("/AI-Generated Art_images.pdf")

# Get the first page
page = pdf.Pages.get_Item(0)

# Create a PdfTextExtractor object
textExtractor = PdfTextExtractor(page)

# Create a PdfTextExtractOptions object
extractOptions = PdfTextExtractOptions()

# Specify the rectangle area to extract
extractOptions.ExtractArea = RectangleF(80.0, 120.0, 450.0, 120.0)

# Extract text from the rectangle area
text = textExtractor.ExtractText(extractOptions)

# Write the extracted text to a txt file
extractedText = open("/extracttext-rectangle.txt", "w", encoding = "utf-8")
extractedText.write(text)

# Release resources
extractedText.close()
pdf.Close()

Result Preview:

Python Extract Text from the Rectangle Area in PDF

Key Steps Explained:

Create a new PdfDocument object and load a PDF file.
Access the first page of the document.
Create PdfTextExtractor and PdfTextExtractOptions objects to configure the extraction behavior.
Use the ExtractArea property of PdfTextExtractOptions to define the rectangular region for text extraction.
Call PdfTextExtractor.ExtractText() to extract text from the specified region on the selected page.

💡

In the line: extractOptions.ExtractArea = RectangleF(80.0, 120.0, 450.0, 120.0), the four parameters represent the X coordinate, Y coordinate, width, and height of the rectangular area, respectively. By adjusting these values, you can precisely control the location and size of the area you want to extract text from — ensuring that only the desired content is retrieved.

Extract Tables from PDF and Export to CSV

Tables in PDF documents are often stored in an unstructured format, making direct extraction and reuse challenging—especially when dealing with borderless tables, merged cells, or tables spanning multiple pages. In this section, we will demonstrate how to accurately identify tables within PDFs using Spire.PDF and Spire.XLS, and export them into structured formats like Excel or CSV. This approach helps you efficiently process and reuse your data.

Full Code Example – Extract Tables from PDF and Save as CSV:

from spire.pdf import *
from spire.pdf.common import *
from spire.xls import *


# Create a PdfDocument instance
doc = PdfDocument()

# Load a PDF
doc.LoadFromFile("/Population.pdf")

# Create a Workbook object
workbook = Workbook()
# Clear all worksheets
workbook.Worksheets.Clear()

# Create a PdfTableExtractor instance
extractor = PdfTableExtractor(doc)

sheetNumber = 1

# Loop through pages of the PDF
for pageIndex in range(doc.Pages.Count):
    # Get tables from the current page
    tableList = extractor.ExtractTable(pageIndex)

    # Loop through tables
    if tableList is not None and len(tableList) > 0:
        for table in tableList:
            # Add a worksheet to the current workbook
            sheet = workbook.Worksheets.Add(f"Sheet{sheetNumber}")

            #  Get the number of rows and columns
            row = table.GetRowCount()
            column = table.GetColumnCount()

            # Loop through rows and columns of the table
            for i in range(row):
                for j in range(column):
                    # Get the text in the current cell
                    text = table.GetText(i, j)

                    #  Write the text to the corresponding cell
                    sheet.Range[i + 1, j + 1].Value = text

            sheetNumber += 1

# Save it as a CSV file
workbook.SaveToFile("/extracttable.csv", FileFormat.CSV)
workbook.Dispose()
doc.Close()

Result Preview:

Python Extract Tables from PDF

Key Steps Explained:

Create a PdfDocument instance and load the PDF file.
Create a Workbook instance.
Iterate through all pages in the PDF document.
Use the PdfTableExtractor.ExtractTable() method to extract tables from each page.
Iterate through the extracted tables and add worksheets to the workbook using Workbook.Worksheets.Add().
Retrieve text from PDF table cells via the PdfTable.GetText() method.
Write the extracted text into specific worksheet cells using the Worksheet.Range[].Value property.
Save the workbook as a CSV file with Workbook.SaveToFile().

💡

To save tables as CSV or Excel files, you need to use Spire.XLS. You can install it with the following command: pip install spire.xls.

Quickly Extract Images from PDF Documents

Besides text and tables, images are another common element in PDF files, especially in promotional materials, reports, or scanned documents. These images can be either bitmaps or vector graphics, and their extraction methods differ accordingly. This section will guide you through how to quickly identify and extract embedded images from PDF pages using Spire.PDF, and save them in popular formats like PNG or JPEG for easy archiving, analysis, or content reuse.

Full Code Example – Extract All Images from a PDF Document:

from spire.pdf.common import *
from spire.pdf import *


# Create a PdfDocument instance
doc = PdfDocument()
# Load a PDF file
doc.LoadFromFile("/AI-Generated Art_images.pdf")

# Create a PdfImageHelper object
image_helper = PdfImageHelper()

image_count = 1
# Loop through all pages
for i in range(doc.Pages.Count):
    # Get the image information of the current page
    images_info = image_helper.GetImagesInfo(doc.Pages[i])

    # Get the image and save it as a picture file
    for j in range(len(images_info)):
        image_info = images_info[j]
        output_file = f"/New folder/image{image_count}.png"
        image_info.Image.Save(output_file)
        image_count += 1

doc.Close()

Result Preview:

Python Extract Images from PDF

Key Steps Explained:

Create a PdfDocument instance and load a PDF file.
Create a PdfImageHelper object.
Iterate through all pages in the document.
Use PdfImageHelper.GetImagesInfo(page: PdfPageBase) to retrieve image information from each page.
Loop through the results and save each image using PdfImageInfo.Image.Save().

Conclusion

In today’s article, we demonstrated how to accurately extract text, tables, and images from PDF documents using Python. Whether you are processing contracts and official documents in bulk or extracting data for system integration, Spire provides developers with a stable and efficient solution. As the demand for digital document automation grows, mastering tools like these will significantly boost your daily development productivity.

For more tutorials on working with PDF documents, please visit our homepage!

Extracting PDF Text, Tables, Images in Python – A Precise Guide

Table of contents

Install the Python Library

Extracting PDF Text: Full Pages and Specific Regions

Extract All Text from a Page

Extract Text from a Specific Region in a PDF

Extract Tables from PDF and Export to CSV

Quickly Extract Images from PDF Documents

Conclusion

Subscribe to my newsletter

Casie Liu

Casie Liu