Extract Text from Word Documents in Python: A Complete Guide

Casie LiuCasie Liu
6 min read

When working with Word documents, you may need to extract text for reports, analysis, or further processing. Manually copying and pasting can be tedious, especially for large files. Python offers an efficient way to extract text from a Word document and save it as a plain text file. This guide will show you how to extract text from a paragraph, section, page, or the entire document with ease.

How to Extract Text from Specified Paragraphs in Word Documents

Before diving into the main topic, let's first look at the essential tool for automating this task in Python—Spire.Doc for Python. This powerful library enables a wide range of operations, from finding and replacing text to handling tasks that Microsoft Office doesn’t support directly, such as converting Word documents to Markdown. With its intuitive API, users can quickly get started. You can install it using the pip command: pip install Spire.Doc.

In this guide, we'll explore how to use Spire.Doc for Python to efficiently extract text from Word documents.

Steps to extract text from specified paragraphs:

  • Create an object of the Document class.

  • Read a source Word document from files using the Document.LoadFromFile() method.

  • Get a section through the Document.Sections[] property.

  • Access the specified paragraph with the Section.Paragraphs[] property.

  • Extract text from the paragraph using the Paragraph.Text property and save the extracted text as a Text file.

Here is the code example of extracting text from the fourth paragraph in a Word document:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Load a Word document
document.LoadFromFile("/AI-Generated Art.docx")

# Get the first section in the document
section = document.Sections[0]

# Get the fourth paragraph in the section
paragraph = section.Paragraphs[3]

# Extract the text of the paragraph
para_text = paragraph.Text

# Write the extracted text into a text file
with open("/ParagraphText.txt", "w", encoding="utf-8") as file:

    file.write(para_text)

document.Close()

Extract Text from Paragraphs in Word Documents Using Python

Extract Text from Specified Sections from a Word File

Some Word documents are divided into multiple sections based on their content. When you need a larger portion of text with high relevance, extracting an entire section can be the ideal solution. The following steps will guide you through the process—let’s take a look!

Steps to export text from sections in a Word file:

  • Instantiate a Document class.

  • Load a Word file from the local storage through the Document.LoadFromFile() method.

  • Get a section with the Document.Sections[] property.

  • Create a list to store the extracted data.

  • Iterate through all paragraphs in the section.

  • Get the current paragraph using the Section.Paragraphs[] property.

  • Get the text of each paragraph with the Paragraph.Text property and append them to the list.

  • Write the list as a text file.

The Python example here shows how to extract text from the first section of a Word document:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Load a Word document
document.LoadFromFile("/AI-Generated Art.docx")

# Get the first section in the document
section = document.Sections[0]

# Create a list to store the extracted text
sectiontext = []

# Iterate through the paragraphs in the section
for i in range(section.Paragraphs.Count):
    paragraph = section.Paragraphs[i]

    # Extract the text of each paragraph and append it to the list
    sectiontext.append(paragraph.Text)

# Write the extracted text into a text file
with open("/SectionText.txt", "w", encoding="utf-8") as file:
    file.write("\n".join(sectiontext))

document.Close()

Extract Text from Sections in a Word Document with Python

Export Text from a Page in a Word Document

When working with Word documents, you may need to extract text from a specific page. With Spire.Doc for Python, this can be done efficiently. However, unlike PDFs, Word document pages are dynamically generated based on formatting, meaning content may shift due to edits or layout changes. To ensure accurate extraction, we use the FixedLayoutDocument approach, which processes the document in a way that maintains page structure. The following steps and code examples will guide you through this process.

Steps to extract text from a specified page in Word files:

  • Create a Document instance.

  • Specify the file path of the source Word file through the Document.LoadFromFile() method.

  • Create a FixedLayoutDocument object to make the page a fixed layout and not be influenced by editing.

  • Access a certain page of the Word document with the FixedLayoutDocument.Pages[] property.

  • Extract text on the page using the FixedLayoutPage.Text property.

  • Save the extracted text as a Text file.

Below is a code example demonstrates how to export text on the first page of a Word file:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Load a Word document
document.LoadFromFile("/AI-Generated Art.docx")

# Create an object of the FixedLayoutDocument class and pass the Document object to the class constructor as a parameter
layoutDoc = FixedLayoutDocument(document)

# Access the first page of the document
layoutPage = layoutDoc.Pages[0]

# Get the text of the first page
page_text = layoutPage.Text

# Write the extracted text into a text file
with open("/PageText.txt", "w", encoding="utf-8") as file:
    file.write(page_text)

document.Close()

Python Extract Text from a Page in Word Files

How to Extract Text from the Whole Word Document

For the final step, we move to extracting text from the entire document. This is straightforward—you could simply save the Word document as a text file. However, this method removes all formatting and non-text elements, which may not be ideal if you need structured content. Using Python provides more control, allowing you to extract text while preserving meaningful structure. In the following steps, we’ll demonstrate how to achieve this efficiently using Python.

Steps to extract text from the whole Word document:

  • Create a Document instance.

  • Load a Word document using the Document.LoadFromFile() method.

  • Get all the text of the file through the Document.GetText() method.

  • Save the text as a new Text file.

from spire.doc import *
from spire.doc.common import *

# Create a Document object
document = Document()

# Load a Word document
document.LoadFromFile("/AI-Generated Art.docx")

# Extract the text of the document
document_text = document.GetText()

# Write the extracted text into a text file
with open("/DocumentText.txt", "w", encoding="utf-8") as file:
    file.write(document_text)

document.Close()

Export Text from a Word Document Using Python

The Conclusion

This article provides a comprehensive guide to extracting text from Word documents using Python, covering four levels: paragraph, section, page, and the entire document. Whether you need to extract specific content or process a full document, you'll find the right solution here. To make it easier to follow, we've included step-by-step instructions and code examples. By the end of this article, you'll have the knowledge to efficiently extract text from Word documents without hassle!

0
Subscribe to my newsletter

Read articles from Casie Liu directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Casie Liu
Casie Liu