Extract Hyperlinks from Word Documents with Python [Latest Guide]
"How do I extract hyperlinks from a Word document?" is a common question found across platforms like the Microsoft Community, GitHub, and Stack Overflow. Extracting hyperlinks from Word documents can be a challenging task for many. Since Microsoft Word lacks a built-in feature for direct hyperlink extraction, even basic methods often involve tedious copying and pasting. To streamline this process, this article demonstrates how to use Python to extract hyperlinks from Word files efficiently, boosting your productivity.
Hyperlink Types in Word Documents
There are several forms of hyperlinks in Word documents, each serving a specific purpose. Understanding these types is essential for accurately extracting their text and URLs. Here are the most common types:
- Links to External Websites
These are standard hyperlinks pointing to webpages, often used to reference online resources or provide further reading.
Example: https://www.microsoft.com
- Links to Local or Network Files
These point to files on a local drive or shared network, such as documents, images, or PDFs.
Example: C:\Documents\sample.docx
- Links Within the Document
These navigate to specific sections, headings, or bookmarks within the same document, commonly used in tables of contents or cross-references.
Example: #Chapter3
- Email Links
Using the mailto: protocol, these open the default email client with pre-filled recipient addresses or subjects.
Example: mailto:support@example.com
- FTP or Protocol-Specific Links
These connect to FTP servers or use other protocols like [file://](file://) to access remote files or resources.
Example: ftp://ftp.example.com/file.zip
- Hyperlinks on Images or Shapes
Hyperlinks can also be embedded in non-text elements like images, shapes, or charts, offering additional navigation options.
Python Library to Extract Hyperlinks from Word Files
To complete the task, it is recommended that you try Spire.Doc for Python. This tool is a comprehensive Python library that allows developers to extract links from Word documents easily and quickly.
You can install this powerful Word library using the pip command: pip install Spire.Doc
.
Extract the Specified Hyperlinks from Word Documents
After checking out the types of hyperlinks, let’s move on to the main topic, how to export hyperlinks from Word files. Spire.Doc (short for Spire.Doc for Python) offers the Field.FiledText property to get the links text, and the Field.Code properties to retrieve hyperlink URLs. Let’s take a closer look at how they work in the instructions and the code example below.
Steps to extract specified hyperlinks from Word documents:
Create an instance of the Document class, and load a Word document from files using the Document.LoadFromFile() method.
Find all hyperlinks by iterating through all objects in the Word document and append them to a list.
Get a hyperlink text from the hyperlink collection.
Extract the hyperlink text with the Field.FieldText property.
Get the hyperlink URL with the Field.Code property.
Below is a code example of extracting the first hyperlink of a Word document and saving it as a Text file:
from spire.doc import *
from spire.doc.common import *
# Create a Document object
doc = Document()
# Load a Word file
doc.LoadFromFile("/sample.docx")
# Find all hyperlinks in the Word document
hyperlinks = []
for i in range(doc.Sections.Count):
section = doc.Sections.get_Item(i)
for j in range(section.Body.ChildObjects.Count):
sec = section.Body.ChildObjects.get_Item(j)
if sec.DocumentObjectType == DocumentObjectType.Paragraph:
for k in range((sec if isinstance(sec, Paragraph) else None).ChildObjects.Count):
para = (sec if isinstance(sec, Paragraph) else None).ChildObjects.get_Item(k)
if para.DocumentObjectType == DocumentObjectType.Field:
field = para if isinstance(para, Field) else None
if field.Type == FieldType.FieldHyperlink:
hyperlinks.append(field)
# Get the first hyperlink text and URL
if hyperlinks:
first_hyperlink = hyperlinks[0]
hyperlink_text = first_hyperlink.FieldText
hyperlink_url = first_hyperlink.Code.split('HYPERLINK ')[1].strip('"')
# Save to a text file
with open("/FirstHyperlink.txt", "w") as file:
file.write(f"Text: {hyperlink_text}\nURL: {hyperlink_url}\n")
# Close the document
doc.Close()
Extract All Hyperlinks from a Word Document
Extracting all the hyperlinks in a Word document can serve various purposes, such as verifying their validity or organizing them for archiving. Automating this task with Python saves both time and effort. Unlike extracting individual links, retrieving all links involves iterating through the entire hyperlink collection to ensure no links are missed. This guide outlines the steps and provides a code example for a comprehensive approach.
Steps to extract all hyperlinks from Word documents:
Create a Document instance and read a Word file from the local storage using the Document.LoadFromFile() method.
Loop through elements in the Word document to get all hyperlinks, and save them in a list.
Iterate through the list of the hyperlink collection.
Extract the hyperlink text with the Field.FieldText property.
Get the hyperlink URL with the Field.Code property.
Here is an example of extracting all hyperlinks from a Word file and saving them as a Text file:
from spire.doc import *
from spire.doc.common import *
# Create a Document object
doc = Document()
# Load a Word file
doc.LoadFromFile("/sample.docx")
# Find all hyperlinks in the Word document
hyperlinks = []
for i in range(doc.Sections.Count):
section = doc.Sections.get_Item(i)
for j in range(section.Body.ChildObjects.Count):
sec = section.Body.ChildObjects.get_Item(j)
if sec.DocumentObjectType == DocumentObjectType.Paragraph:
for k in range((sec if isinstance(sec, Paragraph) else None).ChildObjects.Count):
para = (sec if isinstance(sec, Paragraph) else None).ChildObjects.get_Item(k)
if para.DocumentObjectType == DocumentObjectType.Field:
field = para if isinstance(para, Field) else None
if field.Type == FieldType.FieldHyperlink:
hyperlinks.append(field)
# Save all hyperlinks text and URL to a text file
with open("/AllHyperlinks.txt", "w") as file:
for i, hyperlink in enumerate(hyperlinks):
hyperlink_text = hyperlink.FieldText
hyperlink_url = hyperlink.Code.split('HYPERLINK ')[1].strip('"')
file.write(f"Hyperlink {i+1}:\nText: {hyperlink_text}\nURL: {hyperlink_url}\n\n")
# Close the document
doc.Close()
The Bottom Line
This page provides a comprehensive guide on extracting hyperlinks from Word documents, covering both specific and all hyperlinks. Each section includes clear, step-by-step instructions and practical code examples. By following this guide, you'll discover how easy it is to extract hyperlinks efficiently!
Subscribe to my newsletter
Read articles from Casie Liu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by