How to Quickly Convert Word Document to HTML in Python


Sometimes you may need to publish a local Word document to a website, or format a marketing email in HTML. In both cases, converting Word documents to HTML becomes necessary. Fortunately, you no longer have to do it manually—Python offers a fast and simple way to automate the process. In this post, I’ll show you how to quickly turn your Word files into clean, web-ready HTML with just a few lines of code!
3 Steps to Convert Word Files to HTML in Python
To fully preserve formatting like bold text, font colors, images, and tables, it's better to directly convert Word documents to HTML. Instead of manually saving through MS Word, Python offers a faster solution. In this guide, I‘ll use Spire.Doc, a powerful and professional library, to automate the Word-to-HTML conversion.
Steps to convert a Word document to HTML:
Create an object of the Document class.
Load a Word document from files using the Document.LoadFromFile() method.
Convert the Word document to HTML by saving it as a new HTML file with the Document.SaveToFile() method.
from spire.doc import *
from spire.doc.common import *
# Create a Document instance
document = Document()
# Load a doc or docx document
document.LoadFromFile("/sample.docx")
# Save to HTML
document.SaveToFile("/WordToHtml.html", FileFormat.Html)
document.Close()
Convert Word Documents to HTML with Custom Settings
Sometimes, you don’t need the entire document in HTML. For example, you might only want plain text without any formatting, or formatted text without images, or even tables with specific customizations. To address these needs, Spire.Doc offers the HtmlExportOptions class. Let's explore how to use it for customizing Word to HTML conversion.
Steps to convert a Word file to HTML and customize conversion options:
Create an instance of the Document class.
Read a source Word file through the Document.LoadFromFile() method.
Embed CSS styles with the Document.HtmlExportOptions.CssStyleSheetType property.
Set to embed images or not using the Document.HtmlExportOptions.ImageEmbedded property.
Set whether to export text from fields as plain text with the Document.HtmlExportOptions.IsTextInputFormFieldAsText property.
Save the resulting document as a new one.
from spire.doc import *
from spire.doc.common import *
# Create a Document instance
document = Document()
# Load a doc or docx document
document.LoadFromFile("/sample.docx")
# Embed css styles
document.HtmlExportOptions.CssStyleSheetFileName = "/sample.css"
document.HtmlExportOptions.CssStyleSheetType = CssStyleSheetType.External
# Set not to embed images and save images to a file
document.HtmlExportOptions.ImageEmbedded = False
document.HtmlExportOptions.ImagesPath = "/New folder/"
# Set whether to export form fields as plain text
document.HtmlExportOptions.IsTextInputFormFieldAsText = True
# Save the document as an html file
document.SaveToFile("/ToHtmlExportOption.html", FileFormat.Html)
document.Close()
The HtmlExportOptions class provides even more flexibility. For example, you can use the HasHeadersFooters property to decide whether to retain headers and footers, and the IsExportDocumentStyles property to specify whether to preserve the document's original styles.
The Conclusion
In this article, we explored how to convert Word files to HTML in Python, whether you need to convert an entire document or customize settings to meet different needs. With a simple and automated solution, Python makes Word to HTML conversion faster and easier than ever. Try it today and streamline your workflow!
Subscribe to my newsletter
Read articles from Casie Liu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
