Efficiently Parse RPM's XML Metadata Files with SAX Parser and pulldom
Scope
Uyuni is an open source systems management solution that can be used to manage multiple Linux distributions.
Under the framework of the Google Summer of Code 2024 program, we're trying to develop a lazy reposync service that replaces the old reposync, in Python language.
The new lazy reposync service should be able to retrieve packages metadata for both Red Hat and Debian distributions, using only the repository's metadata.
For more detailed information about the project, please refer to this GitHub issue.
RPM Metadata
We can find most of the RPM packages' metadata of a given Red Hat distribution in a file called primary.xml
on its website, usually under ../repodata
path.
(Note that there are other files in the repodata that might have useful packages' metadata, like filelists.xml
and others)
The new lazy reposync service is responsible for downloading, parsing and importing the metadata in this file to the local database.
Here's an extract of a sample primary.xml file:
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns="http://linux.duke.edu/metadata/common" xmlns:rpm="http://linux.duke.edu/metadata/rpm" packages="590">
<package type="rpm">
<name>gstreamer-plugins-bad</name>
<arch>aarch64</arch>
<version epoch="0" ver="1.22.0" rel="lp155.3.4.1"/>
<checksum type="sha256" pkgid="YES">5f32047b55c0ca2dcc00a00270cd0b10df4df40c6cd9355eeff9b6aa0997657b</checksum>
<summary>GStreamer Streaming-Media Framework Plug-Ins</summary>
<description>GStreamer is a streaming media framework based on graphs of filters
that operate on media data. Applications using this library can do
anything media-related,from real-time sound processing to playing
videos. Its plug-in-based architecture means that new data types or
processing capabilities can be added simply by installing new plug-ins.</description>
<packager>http://bugs.opensuse.org</packager>
<url>https://gstreamer.freedesktop.org</url>
<time file="1700989109" build="1696852591"/>
<size package="2197472" installed="11946385" archive="11965968"/>
<location href="aarch64/gstreamer-plugins-bad-1.22.0-lp155.3.4.1.aarch64.rpm"/>
...
</package>
Parsing with SAX Parser
Sax Parser
SAX (Simple API for XML) is an event-driven online algorithm for lexing and parsing XML documents, with an API developed by the XML-DEV mailing list.[1]
SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole, building the full abstract syntax tree of an XML document for convenience of the user.
SAX parsers operate on each piece of the XML document sequentially, issuing parsing events while making a single pass through the input stream. It memory-efficient because it doesn't load the whole document in memory, like DOM does.
Implementation
The following is an implementation of the SAX Parser to parser RPM's primary.xml
metadata file.
Note: Some intermediate functions are hidden for code clarity.
import logging
import xml.sax
from typing import List
from lzreposync.importUtils import import_package_batch
from lzreposync.rpm_repo import RPMHeader
from spacewalk.server.importlib.importLib import Package, Checksum, Dependency
COMMON_NS = "http://linux.duke.edu/metadata/common"
RPM_NS = "http://linux.duke.edu/metadata/rpm"
...
class Handler(xml.sax.ContentHandler):
"""
SAX parser handler for repository primary.xml files.
"""
def __init__(self, batch_size=20):
super().__init__()
...
def startElementNS(self, name, qname, attrs):
if name == (COMMON_NS, "package"):
self.package = Package()
self.package['header'] = RPMHeader()
elif self.package is not None and name[0] == COMMON_NS and name[1] in self.searched_attrs:
if name[1] == "checksum":
self.set_checksum(attrs)
else:
# Dealing with elements with attributes. Eg: <version epoch="0" ver="1.22.0" rel="lp155.3.4.1"/>
for attr_name in self.searched_attrs[name[1]]:
self.set_element_attribute(attr_name, name[1], attrs)
elif self.package is not None and (name[0] == COMMON_NS or name[0] == RPM_NS) and name[1] in self.searched_chars:
if is_complex(name[1]):
# Dealing with list/nested attributes. Eg: ["provides", "requires", "enhances", "obsoletes"]
self.currentParent = name[1]
elif len(attrs) > 0:
# Rpm element with attributes. Eg: <rpm:header-range start="6200" end="149568"/>
for attr_name in self.searched_attrs[name[1]]:
self.set_element_attribute(attr_name, name[1], attrs)
else:
self.text = ""
elif self.package is not None and name[0] == RPM_NS and name[1] == "entry":
self.add_dependency(attrs)
def characters(self, content):
if self.text is not None:
self.text += content
def endElementNS(self, name, qname):
if name == (COMMON_NS, "package"):
self.count += 1
self.batch.append(self.package)
if self.count >= self.batch_size:
# Import current batch
self.batch_index += 1
import_package_batch(self.batch, self.batch_index, self.batch_size)
self.count = 0
self.batch = []
elif self.package is not None and (name[0] == COMMON_NS or name[0] == RPM_NS) and name[
1] in self.searched_chars:
if name[1] == "arch":
# Tagging 'binary' packages with {isSource:True}, and 'source' ones with {isSource:False}
if self.text == "src":
self.package['header'].is_source = True
if name[1] == "checksum":
self.currentElement['value'] = self.text
self.package["checksum"] = self.currentElement
elif is_complex(name[1]):
self.package[name[1]] = self.attributes_stack # eg: [Dependency] of 'provides' attribute
self.currentParent = None
self.attributes_stack = []
else:
self.package[name[1]] = self.text
self.text = None
Limitation
The problem with SAX parser is that we cannot make the parsing function return a value after a number of parsed elements, instead we should wait until the full file is parsed to get a return value.
Or, if we want to apply a function on a portion of parsed packages each time, we should call that function from within the parser function, and it will be an in-bound approach which complicates the code and makes it harder to read and understand.
In this case, making the SAX parser function return a generator of parsed packages is not be straightforward and might be very complicated.
Parsing with Pulldom
Pulldom Parser
The xml.dom.pulldom
module provides a “pull parser” which can also be asked to produce DOM-accessible fragments of the document where necessary.
The basic concept involves pulling “events” from a stream of incoming XML and processing them.
In contrast to SAX which also employs an event-driven processing model together with callbacks, the user of a pull parser is responsible for explicitly pulling events from the stream, looping over those events until either processing is finished or an error condition occurs. [1]
We'll be using the pulldom parser to overcome the limitation presented by the SAX parser, and to be able to make the parsing function return a generator of packages.
For more details about Python's generator functions, please see: https://realpython.com/introduction-to-python-generators/
Implementation
Here's a portion of the implementation of the pulldom parser, that parses the primary.xml file.
Note: Some intermediate functions are hidden for code clarity.
def parse_primary(self):
"""
Parser the given primary.xml file (gzip format) using xml.dom.pulldom This is an incremental parsing,
it means that not the whole xml file is loaded in memory at once, but package by package.
"""
if not self.primaryFile:
print("Error: primary_file not defined!")
raise ValueError("primary_file missing")
with gzip.open(self.primaryFile) as gz_primary:
doc = pulldom.parse(gz_primary)
for event, node in doc:
if event == pulldom.START_ELEMENT and node.namespaceURI == COMMON_NS and node.tagName == "package":
# New package
doc.expandNode(node)
self.currentPackage = Package()
# Tagging 'source' and 'binary' packages
self.set_pacakge_header(node)
# Parsing package's metadata
for child_node in node.childNodes:
if child_node.nodeType == child_node.ELEMENT_NODE:
self.set_element_node(child_node)
yield self.currentPackage
As you can see, the yield
keyword at the bottom of the function makes the function return a generator of packages. Which means that the package is parsed and returned only when requested.
Here's an example of how we can consume the yielded
packages of that parser in batches:
def batched(iterable, n):
# see: https://docs.python.org/3/library/itertools.html#itertools.batched
if n < 1:
raise ValueError('n must be at least one')
iterator = iter(iterable)
while batch := tuple(islice(iterator, n)):
yield batch
def main():
...
packages = parser.parse_primary() # packages is a generator
for batch in batched(packages, args.batch_size):
print(f"Importing a batch of {len(batch)} packages...")
Conclusion
So we have demonstrated how we can efficiently parse the RPM's primary.xml metadata file using two methods: SAX Parser and Pulldom.
Both methods are memory efficient and best suitable for large xml files.
Subscribe to my newsletter
Read articles from haroune hassine directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by