[Three Steps] Convert PDF to HTML in Python without Effort
While the PDF format offers clear benefits for file sharing and storage, there are times when you may need to convert a PDF to HTML. HTML is more suitable for web design, making it essential to convert PDF documents into a web-friendly format for better site optimization. However, manually converting formats can be time-consuming and prone to errors.
In this article, we'll show you how to use Python to convert a PDF to HTML in just three simple steps, allowing you to quickly obtain a fully functional HTML document without loss.
Python Library to Transfer PDF to HTML
In this guide, we will use Spire.PDF for Python to demonstrate how to complete this task. This Python library is rich in features and easy to use. It allows developers, both novice and experienced, to carry out a wide range of tasks with ease simply and intuitively. For example, creating, editing, and converting PDFs.
You can install the PDF to HTML converter using the PyPI command as follows:
pip install Spire.PDF
3 Steps to Convert PDF to HTML
Spire.PDF for Python offers the PdfDocument.SaveToFile() method to meet your needs. Basically, the three main steps are creating an object, loading the file, and lastly, saving it in HTML format. For a more detailed guide, you can check the steps below.
Steps to transform a PDF file to HTML:
Import essential modules from Spire.PDF for Python.
Create an object of the PdfDocument class.
Specify the file path to load an original PDF document using the PdfDocument.LoadFromFile() method.
Convert the PDF to HTML with the PdfDocument.SaveToFile() method.
Release the resource.
Here is the code example to change PDF to HTML:
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Load a PDF document from the disk
doc.LoadFromFile("/test.pdf")
# Save the PDF document to HTML format
doc.SaveToFile("/PdfToHtml.html", FileFormat.HTML)
# Close the document
doc.Close()
Customize Conversion Options in Transforming PDF to HTML with Python
PDF files typically contain both text and images. Developers can selectively retain images and configure their status during the conversion process. By invoking the SetPdfToHtmlOptions() method, developers can easily customize the final HTML output.
Here are the parameters it accepts:
useEmbeddedSvg (bool). This method stands for whether to embed SVG in the output file.
useEmbeddedImg (bool). The method represents whether to embed images in the resulting HTML file, and it is effective only when the useEmbeddedSvg () method is set to
False
.maxPageOneFile (bool). The method is used to determine how many pages there are per HTML file. And it is also applicable when the useEmbeddedSvg () method is set to
False
.useHighQualityEmbeddedSvg (bool). This method represents whether to use high-quality embedded SVG in the updated HTML document. It is effective only when the useEmbeddedSvg () method is set to
True
.
Steps to convert PDF to HTML and customize conversion options:
Create an instance of the PdfDocument class.
Open a source PDF file from the disk using the PdfDocument.LoadFromFile() method.
Get PdfConvertOptions with the PdfDocument.ConvertOptions property.
Set conversion options using the PdfConvertOptions.SetPdfToHtmlOptions() method.
Save the output HTML document as a new file by calling the PdfDocument.SaveToFile() method.
Below is the code example of customizing conversion settings when transforming PDF to HTML:
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Read a PDF document from the file
doc.LoadFromFile("/test.pdf")
# Set the conversion options to embed images in the resulting HTML and limit one page per HTML file
pdfToHtmlOptions = doc.ConvertOptions
pdfToHtmlOptions.SetPdfToHtmlOptions(False, True, 1, False)
# Save the PDF document to HTML format
doc.SaveToFile("/PdfToHtmlWithCustomOptions.html", FileFormat.HTML)
# Close the document
doc.Close()
Convert PDF to HTML Stream in Python
If you plan to further process the HTML data, consider converting the PDF document to an HTML stream. This avoids creating a local HTML file, making it convenient for sending to databases or for transmission purposes.
Steps to change PDFs to HTML stream in Python:
Instantiate a PdfDocument class.
Specify the file path to load a PDF document using the PdfDocument.LoadFromFile() method.
Create an object of Stream.
Save the PDF to HTML stream with the PdfDocument.SaveToStream() method and release the resources.
Here is a code example of extracting PDF to HTML stream:
from spire.pdf.common import *
from spire.pdf import *
# Create an object of the PdfDocument class
doc = PdfDocument()
# Read a PDF document as the source file
doc.LoadFromFile("E/test.pdf")
# Save the PDF document to the HTML stream
fileStream = Stream("/PdfToHtmlStream.html")
doc.SaveToStream(fileStream, FileFormat.HTML)
# Release resources
fileStream.Close()
# Close the document
doc.Close()
The Conclusion
In today’s blog, you have learned how to convert PDF to HTML with Python in just three steps. Customizing the image status in the conversion is also available by setting conversion options. Moreover, you can save PDF as an HTML stream if you need to go further. We hope you find it useful!
Subscribe to my newsletter
Read articles from Casie Liu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by