Extract Hyperlinks from Word Documents with Python [2024]

"How do I extract hyperlinks from a Word document?" is a common question found across platforms like the Microsoft Community, GitHub, and Stack Overflow. Extracting hyperlinks from Word documents can be a challenging task for many. Since Microsoft Word lacks a built-in feature for direct hyperlink extraction, even basic methods often involve tedious copying and pasting. To streamline this process, this article demonstrates how to use Python to extract hyperlinks from Word files efficiently, boosting your productivity.

Hyperlink Types in Word Documents

There are several forms of hyperlinks in Word documents, each serving a specific purpose. Understanding these types is essential for accurately extracting their text and URLs. Here are the most common types:

Links to External Websites

These are standard hyperlinks pointing to webpages, often used to reference online resources or provide further reading.

Example: https://www.microsoft.com

Links to Local or Network Files

These point to files on a local drive or shared network, such as documents, images, or PDFs.

Example: C:\Documents\sample.docx

Links Within the Document

These navigate to specific sections, headings, or bookmarks within the same document, commonly used in tables of contents or cross-references.

Example: #Chapter3

Email Links

Using the mailto: protocol, these open the default email client with pre-filled recipient addresses or subjects.

Example: mailto:support@example.com

FTP or Protocol-Specific Links

These connect to FTP servers or use other protocols like [file://](file://) to access remote files or resources.

Example: ftp://ftp.example.com/file.zip

Hyperlinks on Images or Shapes

Hyperlinks can also be embedded in non-text elements like images, shapes, or charts, offering additional navigation options.

Python Library to Extract Hyperlinks from Word Files

To complete the task, it is recommended that you try Spire.Doc for Python. This tool is a comprehensive Python library that allows developers to extract links from Word documents easily and quickly.

You can install this powerful Word library using the pip command: pip install Spire.Doc.

Extract the Specified Hyperlinks from Word Documents

After checking out the types of hyperlinks, let’s move on to the main topic, how to export hyperlinks from Word files. Spire.Doc (short for Spire.Doc for Python) offers the Field.FiledText property to get the links text, and the Field.Code properties to retrieve hyperlink URLs. Let’s take a closer look at how they work in the instructions and the code example below.

Steps to extract specified hyperlinks from Word documents:

Create an instance of the Document class, and load a Word document from files using the Document.LoadFromFile() method.
Find all hyperlinks by iterating through all objects in the Word document and append them to a list.
Get a hyperlink text from the hyperlink collection.
Extract the hyperlink text with the Field.FieldText property.
Get the hyperlink URL with the Field.Code property.

Below is a code example of extracting the first hyperlink of a Word document and saving it as a Text file:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()

# Load a Word file
doc.LoadFromFile("/sample.docx")

# Find all hyperlinks in the Word document
hyperlinks = []
for i in range(doc.Sections.Count):
    section = doc.Sections.get_Item(i)
    for j in range(section.Body.ChildObjects.Count):
        sec = section.Body.ChildObjects.get_Item(j)
        if sec.DocumentObjectType == DocumentObjectType.Paragraph:
            for k in range((sec if isinstance(sec, Paragraph) else None).ChildObjects.Count):
                para = (sec if isinstance(sec, Paragraph) else None).ChildObjects.get_Item(k)
                if para.DocumentObjectType == DocumentObjectType.Field:
                    field = para if isinstance(para, Field) else None
                    if field.Type == FieldType.FieldHyperlink:
                        hyperlinks.append(field)

# Get the first hyperlink text and URL
if hyperlinks:
    first_hyperlink = hyperlinks[0]
    hyperlink_text = first_hyperlink.FieldText
    hyperlink_url = first_hyperlink.Code.split('HYPERLINK ')[1].strip('"') 

    # Save to a text file
    with open("/FirstHyperlink.txt", "w") as file:
        file.write(f"Text: {hyperlink_text}\nURL: {hyperlink_url}\n")

# Close the document
doc.Close()

Extract the Specified Hyperlinks from Word Documents

Extract All Hyperlinks from a Word Document

Extracting all the hyperlinks in a Word document can serve various purposes, such as verifying their validity or organizing them for archiving. Automating this task with Python saves both time and effort. Unlike extracting individual links, retrieving all links involves iterating through the entire hyperlink collection to ensure no links are missed. This guide outlines the steps and provides a code example for a comprehensive approach.

Steps to extract all hyperlinks from Word documents:

Create a Document instance and read a Word file from the local storage using the Document.LoadFromFile() method.
Loop through elements in the Word document to get all hyperlinks, and save them in a list.
Iterate through the list of the hyperlink collection.
- Extract the hyperlink text with the Field.FieldText property.
- Get the hyperlink URL with the Field.Code property.

Here is an example of extracting all hyperlinks from a Word file and saving them as a Text file:

from spire.doc import *
from spire.doc.common import *

# Create a Document object
doc = Document()

# Load a Word file
doc.LoadFromFile("/sample.docx")

# Find all hyperlinks in the Word document
hyperlinks = []
for i in range(doc.Sections.Count):
    section = doc.Sections.get_Item(i)
    for j in range(section.Body.ChildObjects.Count):
        sec = section.Body.ChildObjects.get_Item(j)
        if sec.DocumentObjectType == DocumentObjectType.Paragraph:
            for k in range((sec if isinstance(sec, Paragraph) else None).ChildObjects.Count):
                para = (sec if isinstance(sec, Paragraph) else None).ChildObjects.get_Item(k)
                if para.DocumentObjectType == DocumentObjectType.Field:
                    field = para if isinstance(para, Field) else None
                    if field.Type == FieldType.FieldHyperlink:
                        hyperlinks.append(field)

# Save all hyperlinks text and URL to a text file
with open("/AllHyperlinks.txt", "w") as file:
    for i, hyperlink in enumerate(hyperlinks):
        hyperlink_text = hyperlink.FieldText
        hyperlink_url = hyperlink.Code.split('HYPERLINK ')[1].strip('"')
        file.write(f"Hyperlink {i+1}:\nText: {hyperlink_text}\nURL: {hyperlink_url}\n\n")

# Close the document
doc.Close()

Extract All Hyperlinks from Word Documents

The Bottom Line

This page provides a comprehensive guide on extracting hyperlinks from Word documents, covering both specific and all hyperlinks. Each section includes clear, step-by-step instructions and practical code examples. By following this guide, you'll discover how easy it is to extract hyperlinks efficiently!

Extract Hyperlinks from Word Documents with Python [Latest Guide]

Table of contents

Hyperlink Types in Word Documents

Python Library to Extract Hyperlinks from Word Files

Extract the Specified Hyperlinks from Word Documents

Extract All Hyperlinks from a Word Document

The Bottom Line

Subscribe to my newsletter

Casie Liu

Casie Liu