Efficiently Parse RPM's XML Metadata Files with SAX Parser and pulldom

haroune hassineharoune hassine
6 min read

Scope

Uyuni is an open source systems management solution that can be used to manage multiple Linux distributions.

Under the framework of the Google Summer of Code 2024 program, we're trying to develop a lazy reposync service that replaces the old reposync, in Python language.

The new lazy reposync service should be able to retrieve packages metadata for both Red Hat and Debian distributions, using only the repository's metadata.

For more detailed information about the project, please refer to this GitHub issue.

RPM Metadata

We can find most of the RPM packages' metadata of a given Red Hat distribution in a file called primary.xml on its website, usually under ../repodata path.

(Note that there are other files in the repodata that might have useful packages' metadata, like filelists.xml and others)

The new lazy reposync service is responsible for downloading, parsing and importing the metadata in this file to the local database.

Here's an extract of a sample primary.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns="http://linux.duke.edu/metadata/common" xmlns:rpm="http://linux.duke.edu/metadata/rpm" packages="590">
<package type="rpm">
  <name>gstreamer-plugins-bad</name>
  <arch>aarch64</arch>
  <version epoch="0" ver="1.22.0" rel="lp155.3.4.1"/>
  <checksum type="sha256" pkgid="YES">5f32047b55c0ca2dcc00a00270cd0b10df4df40c6cd9355eeff9b6aa0997657b</checksum>
  <summary>GStreamer Streaming-Media Framework Plug-Ins</summary>
  <description>GStreamer is a streaming media framework based on graphs of filters
that operate on media data. Applications using this library can do
anything media-related,from real-time sound processing to playing
videos. Its plug-in-based architecture means that new data types or
processing capabilities can be added simply by installing new plug-ins.</description>
  <packager>http://bugs.opensuse.org</packager>
  <url>https://gstreamer.freedesktop.org</url>
  <time file="1700989109" build="1696852591"/>
  <size package="2197472" installed="11946385" archive="11965968"/>
  <location href="aarch64/gstreamer-plugins-bad-1.22.0-lp155.3.4.1.aarch64.rpm"/>
...
</package>

Parsing with SAX Parser

Sax Parser

SAX (Simple API for XML) is an event-driven online algorithm for lexing and parsing XML documents, with an API developed by the XML-DEV mailing list.[1]

SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole, building the full abstract syntax tree of an XML document for convenience of the user.

SAX parsers operate on each piece of the XML document sequentially, issuing parsing events while making a single pass through the input stream. It memory-efficient because it doesn't load the whole document in memory, like DOM does.

Implementation

The following is an implementation of the SAX Parser to parser RPM's primary.xml metadata file.

Note: Some intermediate functions are hidden for code clarity.

import logging
import xml.sax
from typing import List

from lzreposync.importUtils import import_package_batch
from lzreposync.rpm_repo import RPMHeader
from spacewalk.server.importlib.importLib import Package, Checksum, Dependency

COMMON_NS = "http://linux.duke.edu/metadata/common"
RPM_NS = "http://linux.duke.edu/metadata/rpm"

...

class Handler(xml.sax.ContentHandler):
    """
    SAX parser handler for repository primary.xml files.
    """

    def __init__(self, batch_size=20):
        super().__init__()
        ...

    def startElementNS(self, name, qname, attrs):
        if name == (COMMON_NS, "package"):
            self.package = Package()
            self.package['header'] = RPMHeader() 
        elif self.package is not None and name[0] == COMMON_NS and name[1] in self.searched_attrs:
            if name[1] == "checksum":
                self.set_checksum(attrs)
            else:
                # Dealing with elements with attributes. Eg: <version epoch="0" ver="1.22.0" rel="lp155.3.4.1"/>
                for attr_name in self.searched_attrs[name[1]]:
                    self.set_element_attribute(attr_name, name[1], attrs)
        elif self.package is not None and (name[0] == COMMON_NS or name[0] == RPM_NS) and name[1] in self.searched_chars:
            if is_complex(name[1]):
                # Dealing with list/nested attributes. Eg: ["provides", "requires", "enhances", "obsoletes"]
                self.currentParent = name[1]
            elif len(attrs) > 0:
                # Rpm element with attributes. Eg: <rpm:header-range start="6200" end="149568"/>
                for attr_name in self.searched_attrs[name[1]]:
                    self.set_element_attribute(attr_name, name[1], attrs)
            else:
                self.text = ""
        elif self.package is not None and name[0] == RPM_NS and name[1] == "entry":
            self.add_dependency(attrs)

    def characters(self, content):
        if self.text is not None:
            self.text += content

    def endElementNS(self, name, qname):
        if name == (COMMON_NS, "package"):
            self.count += 1
            self.batch.append(self.package)
            if self.count >= self.batch_size:
                # Import current batch
                self.batch_index += 1
                import_package_batch(self.batch, self.batch_index, self.batch_size)
                self.count = 0
                self.batch = []
        elif self.package is not None and (name[0] == COMMON_NS or name[0] == RPM_NS) and name[
            1] in self.searched_chars:
            if name[1] == "arch":
                # Tagging 'binary' packages with {isSource:True}, and 'source' ones with {isSource:False}
                if self.text == "src":
                    self.package['header'].is_source = True
            if name[1] == "checksum":
                self.currentElement['value'] = self.text
                self.package["checksum"] = self.currentElement
            elif is_complex(name[1]):
                self.package[name[1]] = self.attributes_stack  # eg: [Dependency] of 'provides' attribute
                self.currentParent = None
                self.attributes_stack = []
            else:
                self.package[name[1]] = self.text
            self.text = None

Limitation

The problem with SAX parser is that we cannot make the parsing function return a value after a number of parsed elements, instead we should wait until the full file is parsed to get a return value.

Or, if we want to apply a function on a portion of parsed packages each time, we should call that function from within the parser function, and it will be an in-bound approach which complicates the code and makes it harder to read and understand.

In this case, making the SAX parser function return a generator of parsed packages is not be straightforward and might be very complicated.

Parsing with Pulldom

Pulldom Parser

The xml.dom.pulldom module provides a “pull parser” which can also be asked to produce DOM-accessible fragments of the document where necessary.

The basic concept involves pulling “events” from a stream of incoming XML and processing them.

In contrast to SAX which also employs an event-driven processing model together with callbacks, the user of a pull parser is responsible for explicitly pulling events from the stream, looping over those events until either processing is finished or an error condition occurs. [1]

We'll be using the pulldom parser to overcome the limitation presented by the SAX parser, and to be able to make the parsing function return a generator of packages.

For more details about Python's generator functions, please see: https://realpython.com/introduction-to-python-generators/

Implementation

Here's a portion of the implementation of the pulldom parser, that parses the primary.xml file.
Note: Some intermediate functions are hidden for code clarity.

def parse_primary(self):
        """
        Parser the given primary.xml file (gzip format) using xml.dom.pulldom This is an incremental parsing,
        it means that not the whole xml file is loaded in memory at once, but package by package.
        """
        if not self.primaryFile:
            print("Error: primary_file not defined!")
            raise ValueError("primary_file missing")

        with gzip.open(self.primaryFile) as gz_primary:
            doc = pulldom.parse(gz_primary)
            for event, node in doc:
                if event == pulldom.START_ELEMENT and node.namespaceURI == COMMON_NS and node.tagName == "package":
                    # New package
                    doc.expandNode(node)
                    self.currentPackage = Package()

                    # Tagging 'source' and 'binary' packages
                    self.set_pacakge_header(node)

                    # Parsing package's metadata
                    for child_node in node.childNodes:
                        if child_node.nodeType == child_node.ELEMENT_NODE:
                            self.set_element_node(child_node)

                    yield self.currentPackage

As you can see, the yield keyword at the bottom of the function makes the function return a generator of packages. Which means that the package is parsed and returned only when requested.

Here's an example of how we can consume the yielded packages of that parser in batches:

def batched(iterable, n):
    # see: https://docs.python.org/3/library/itertools.html#itertools.batched
    if n < 1:
        raise ValueError('n must be at least one')
    iterator = iter(iterable)
    while batch := tuple(islice(iterator, n)):
        yield batch

def main():
    ...
    packages = parser.parse_primary()  # packages is a generator
    for batch in batched(packages, args.batch_size):
        print(f"Importing a batch of {len(batch)} packages...")

Conclusion

So we have demonstrated how we can efficiently parse the RPM's primary.xml metadata file using two methods: SAX Parser and Pulldom.

Both methods are memory efficient and best suitable for large xml files.

1
Subscribe to my newsletter

Read articles from haroune hassine directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

haroune hassine
haroune hassine