XML and XML Schema : Documentation

Ronit BanerjeeRonit Banerjee
15 min read

XML (eXtensible Markup Language) and XML Schema are foundational technologies for the Semantic Web, and understanding them is crucial for working with more advanced Semantic Web technologies like RDF and OWL.

Introduction to XML

XML, or Extensible Markup Language, is a markup language similar to HTML. However, unlike HTML which has predefined tags, XML allows you to define your own tags tailored to your specific needs. This makes XML a powerful tool for storing data in a format that can be stored, searched, and shared. The fundamental format of XML is standardized, which means if you share or transmit XML across systems or platforms, the recipient can still parse the data due to the standardized XML syntax. This is explained in detail in this [source](https://developer.mozilla.org/en-US/docs/Web/XML/XML_introduction) from Mozilla Developer Network (MDN).

XML is the basis for many languages, including XHTML, MathML, SVG, RSS, and RDF. You can also define your own XML-based languages. The structure of an XML document is built on tags, with the XML declaration used for the transmission of the meta-data of a document.

For an XML document to be correct, it must be well-formed and conform to all XML syntax rules. It must also conform to semantic rules, which are usually set in an XML schema or a DTD (Document Type Definition).

XML also offers methods (called entities) for referring to some special reserved characters. There are five of these characters that you should know:

- &lt; for the less than sign (`<`)

- &gt; for the greater than sign (`>`)

- &amp; for the ampersand (`&`)

- &quot; for one double-quotation mark (`"`)

- &apos; for one apostrophe (or single-quotation mark) (`'`)

XML is usually used for descriptive purposes, but there are ways to display XML data. If you don't define a specific way for the XML to be rendered, the raw XML is displayed in the browser. One way to style XML output is to specify CSS to apply to the document using the xml-stylesheet processing instruction. There is also another more powerful way to display XML: the Extensible Stylesheet Language Transformations (XSLT) which can be used to transform XML into other languages such as HTML. This makes XML incredibly versatile.

This is a high-level overview, and there's a lot more to XML than can be covered in a single response. I recommend reading the provided source for a more in-depth understanding.

XML Syntax

XML is often used for data storage and transport. Here are some basic components of XML syntax:

Elements: XML documents contain elements, defined by a start tag and an end tag. For example, <name>John Doe</name>. Elements can contain text, other elements, or be empty.

Attributes: Elements can have attributes, which are name-value pairs. For example, <person id="1">John Doe</person>. Here, id is an attribute of the person element.

Namespaces: XML namespaces are used to avoid element name conflicts. If two different XML-based languages or vocabularies have elements or attributes with the same name, namespaces can be used to distinguish them.

Here are a couple of XML examples from the [W3Schools XML Tutorial](https://www.w3schools.com/xml/):

Example 1:

<?xml version="1.0" encoding="UTF-8"?>

<note>

<to>Tove</to>

<from>Jani</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

Example 2:

<?xml version="1.0" encoding="UTF-8"?>

<breakfast_menu>

<food>

<name>Belgian Waffles</name>

<price>$5.95</price>

<description>

Two of our famous Belgian Waffles with plenty of real maple syrup

</description>

<calories>650</calories>

</food>

<!-- More food items... -->

</breakfast_menu>

In the first example, note is an element that contains other elements: to, from, heading, and body. In the second example, breakfast_menu is an element that contains food elements, each of which has name, price, description, and calories elements.

This tutorial also covers important XML standards such as XML AJAX, XML DOM, XML XPath, XML XSLT, XML XQuery, XML DTD, and XML Schema. It's a great resource for anyone looking to learn XML.

XML Trees and DOM

XML, or Extensible Markup Language, is a markup language used to encode documents in a format that is both human-readable and machine-readable. It is designed to store and transport data. XML documents form a tree structure that starts at "the root" and branches to "the leaves". Each element in the XML document can have child elements, attributes, and text.

The tree structure of XML is intuitive and flexible, allowing complex data structures to be described in a consistent and easy-to-process way. In an XML tree, each internal node represents an XML element, and the leaf nodes represent the content of the elements. The root of the tree represents the XML document itself. This structure allows for easy navigation and manipulation of the XML data.

Here is a resource that provides a detailed explanation of the tree structure of XML documents: [XML Tree Structure](https://www.tutorialspoint.com/xml/xml_tree_structure.htm)

The Document Object Model (DOM) is a programming interface for XML and HTML documents. It represents the structure of a document and allows programs and scripts to manipulate the document's structure, style, and content. The DOM represents a document as a tree structure, where each node in the tree corresponds to a part of the document (such as an XML or HTML tag).

The DOM provides a representation of the document as a structured group of nodes and objects, possessing various properties and methods. Nodes can also have event handlers attached to them. Once an event is triggered, the event handlers get executed.

Here are a couple of resources that provide a detailed explanation of how to work with XML using the Document Object Model (DOM):

1. [XMLDocument - Web APIs | MDN](https://developer.mozilla.org/en-US/docs/Web/API/XMLDocument)

2. [XML DOM - XML Files](https://www.xmlfiles.com/dom/)

Research Paper:

The research paper titled "GPML: an XML-based standard for the interchange of genetic programming trees" discusses the use of XML for representing genetic programming trees. The paper proposes a Genetic Programming Markup Language (GPML), an XML-based standard for the interchange of genetic programming trees. The authors present a formal definition of this standard and describe details of an implementation. They also discuss the use of the Document Object Model (DOM) for GP evolution. You can read the full paper [here](http://dx.doi.org/10.1007/s10710-019-09370-4).

XML Namespaces

XML namespaces are used to avoid name conflicts in XML documents. They are essential for two primary reasons: to disambiguate between two elements that share the same name and to group elements relating to a common idea together. For example, in (x)html, there is a table element, and there is also an element of the same name in XSL-FO. Similarly, 'a', 'title', and 'style' are all elements in both (x)html and SVG. XML namespaces help distinguish between these elements that share the same name but belong to different XML vocabularies.

An XML namespace is a unique URI (Uniform Resource Identifier). In an XML document, the URI is associated with a prefix, and this prefix is used with each element to indicate to which namespace the element belongs. For example, 'rdf:description', 'xsl:template', 'zblsa:data' are all examples of elements with namespace prefixes. The part before the colon is the prefix, the part after the colon is the local part, any prefixed element is a qualified name, and any un-prefixed element is an unqualified name.

To use a namespace, you first associate the URI with a namespace. This is done using the 'xmlns' attribute. For example, '<foo:tag xmlns:foo=\"[http://me.com/namespaces/foofoo\](http://me.com/namespaces/foofoo)">' defines 'foo' as the prefix for the namespace for that element tag. The attribute prefixed with 'xmlns' works like a command to say "link the following letters to a URI". As no well-formed document can contain two identical attributes, the part that appears after the colon stops the same prefix being defined twice simultaneously.

It's also possible to define multiple prefixes for the same namespace or the same prefix for multiple namespaces, depending on their context. However, it's not recommended to use the same prefix to refer to different namespaces, as it can lead to confusion.

Attributes can also be placed in a specific namespace or left unqualified. The normal "rule" for attributes is to place them within a namespace only if the attribute in question is defined by a particular namespace. Attributes that have no namespace prefix are not defined by a namespace. Note that this is not the same as being in the default namespace.

For more in-depth information, you can refer to the article ["XML Namespaces Explained"](https://www.sitepoint.com/xml-namespaces-explained/) on SitePoint.

In the context of scientific research, XML namespaces are used in the Systems Biology Markup Language (SBML), an XML-based format for representing quantitative models of biological interest. SBML uses XML namespaces to ensure the consistent specification of models, facilitating both software development and model exchange. More information about this can be found in the research paper ["The Systems Biology Markup Language (SBML): Language Specification for Level 3 Version 1 Core"](http://dx.doi.org/10.1038/npre.2010.4959.1).

XML Processing

Here are some resources and research papers that provide in-depth information on parsing and generating XML in different programming languages:

Java:

Oracle provides a comprehensive tutorial on Java API for XML Processing (JAXP) which allows you to parse, transform, validate and query XML documents using Java. You can find the tutorial here: [Java API for XML Processing (JAXP) Tutorial](https://docs.oracle.com/javase/tutorial/jaxp/TOC.html).

Python:

Python's official documentation provides a detailed guide on how to use the xml.etree.ElementTree module for parsing and creating XML data. You can find the guide here: [Python XML with ElementTree: Beginner's Guide](https://docs.python.org/3/library/xml.etree.elementtree.html).

JavaScript:

W3Schools provides a tutorial on XML DOM (Document Object Model) which allows you to access and manipulate XML documents using JavaScript. You can find the tutorial here: [XML DOM Tutorial](https://www.w3schools.com/xml/dom_intro.asp).

In addition to these resources, here are some research papers that delve into the topic:

1. "A formal MIM specification and tools for the common exchange of MIM diagrams: an XML-Based format, an API, and a validation method" by Luna, A., et al. (2011). This paper discusses the development of a formal implementation of the Molecular Interaction Map (MIM) notation based on a core set of previously defined glyphs. This implementation provides a detailed specification of the properties of the elements of the MIM notation. Building upon this specification, a machine-readable format is provided as a standardized mechanism for the storage and exchange of MIM diagrams. This new format is accompanied by a Java-based application programming interface to help software developers to integrate MIM support into software projects. A validation mechanism is also provided to determine whether MIM datasets are in accordance with syntax rules provided by the new specification. [Link to the paper](http://dx.doi.org/10.1186/1471-2105-12-167).

Please note that these resources provide a starting point for learning XML processing in different programming languages. Depending on your specific use case and the complexity of the XML data you are working with, you might need to explore more advanced topics and tools.

XML Schema

XML Schema, often referred to as XML Schema Definition (XSD), is a language used to describe the structure of an XML document. It defines the legal building blocks of an XML document, including the elements and attributes that can appear, the number and order of child elements, the data types for elements and attributes, and the default and fixed values for elements and attributes.

Here is an example of an XML Schema:

<?xml version="1.0"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="note">

<xs:complextype>

<xs:sequence>

<xs:element name="to" type="xs:string"></xs:element>

<xs:element name="from" type="xs:string"></xs:element>

<xs:element name="heading" type="xs:string"></xs:element>

<xs:element name="body" type="xs:string"></xs:element>

</xs:sequence>

</xs:complextype>

</xs:element>

</xs:schema>

XML Schema is widely used because it supports data types, making it easier to describe allowable document content, validate the correctness of data, define data facets (restrictions on data), define data patterns (data formats), and convert data between different data types.

Moreover, XML Schemas are written in XML, which means you don't need to learn a new language to use them. You can use an XML editor to edit the Schema files, an XML parser to parse the Schema files, and the Schemas can be manipulated with the XML DOM. They can also be transformed with XSLT.

XML Schemas are extensible, meaning your Schema can be reused in other Schemas, you can create your own data types derived from the standard types, and multiple schemas can be referenced in the same document.

XML Schemas also play a crucial role in securing data communication. They ensure that both sender and receiver have the same "expectations" about the content, which is particularly important when sending data from a sender to a receiver.

For example, the date "06-08-2020" could be interpreted either as 6 August or as 8 June. However, an XML element with a data type like this: <date type="date">2020-08-06</date> ensures a mutual understanding of the content because the XML data type "date" requires the format "YYYY-MM-DD".

Finally, it's important to note that a well-formed XML document (one that conforms to the XML syntax rules) can still contain errors. XML Schemas can help catch most of these errors, ensuring that an XML document validated against an XML Schema is both "Well Formed" and "Valid" ([source](https://www.w3schools.blog/xsd-xml-schema-definition-tutorial/)).

XML Schema Definition (XSD):

XML Schema Definition, also known as XSD, is a language that describes the structure of an XML document. It defines the elements and attributes that can appear in a document, the number and order of child elements, data types for elements and attributes, and default and fixed values for elements and attributes. XML Schemas are written in XML, so you don't have to learn a new language to use them. They are also extensible, allowing you to reuse your Schema in other Schemas, create your own data types derived from the standard types, and reference multiple schemas in the same document. XML Schemas are particularly useful for securing data communication, as they allow the sender to describe the data in a way that the receiver will understand. This is especially important when dealing with data types that can be interpreted differently in different contexts, such as dates. [Source: W3Schools](https://www.w3schools.com/xml/schema_intro.asp)

XML Schema Elements:

XML Schema elements are the building blocks of an XML document. They are defined in the XML Schema and can include complex types, which are a combination of elements and attributes. For example, in the provided XSD example, the "note" element is a complex type that includes a sequence of four elements: "to", "from", "heading", and "body". Each of these elements is defined as a string data type. [Source: W3Schools](https://www.w3schools.com/xml/schema_intro.asp)

XML Schema Attributes:

XML Schema attributes provide additional information about XML elements. They are defined in the XML Schema and can be used to specify data types, default values, and fixed values for elements. Attributes are typically used to provide information that is not part of the main data content, such as identifiers, names, or characteristics of elements. [Source: W3Schools](https://www.w3schools.com/xml/schema_intro.asp)

XML Schema Structures:

XML Schema structures define the organization of elements and attributes in an XML document. They include sequences, choices, and all groups. A sequence is a group of elements that must appear in a specific order. A choice is a group of elements where only one can appear. An all group is a group of elements where all elements must appear, but the order is not important. These structures allow for more complex and flexible XML document designs. [Source: W3Schools](https://www.w3schools.com/xml/schema_intro.asp)

Built-in datatypes in XML Schema include string, decimal, integer, boolean, date, and time, among others. You can also define your own simple types. For example, you might define a simple type that restricts strings to a certain pattern or restricts integers to a certain range.

Namespaces play a crucial role in XML Schema Definition (XSD) schemas, helping to avoid conflicts between element names and attribute names in XML documents. Here's a more detailed explanation based on the [source](https://www.brainbell.com/tutorials/XML/Namespaces_And_XSD_Schemas.htm) I found:

The xsd Prefix:

The prefix "xsd" is commonly used with the XSD schema to reference elements and attributes that are used to construct schemas for your custom markup languages. For example, the namespace declaration for a schema document might look like this:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

This code shows how the prefix "xsd" is used to declare the XSD schema explicitly. While "xsd" has become somewhat of a standard among XML developers, it's important to note that the prefix could be named anything you want.

Referencing Schema Documents:

Namespaces also play an important role in documents that rely on an XSD schema for validation. To identify the physical schema document for a document, you must use a special attribute and assign the location of the schema document to it. There are two attributes you can use to accomplish this task:

- schemaLocation: Locates a schema and its associated namespace

- noNamespaceSchemaLocation: Locates a schema with no namespace

These attributes are standard attributes that are located in a namespace named http://www.w3.org/2001/XMLSchema-instance. In order to properly reference either of these attributes, you must first explicitly declare the namespace in which they are located. It is standard to use the "xsi" prefix for this namespace, as the following attribute assignment shows:

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

With this namespace declared, you can now use one of the schema location attributes to reference the physical schema document. Here's an example of how this task is carried out:

<trainlog

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation="etml.xsd">

In this example, the noNamespaceSchemaLocation attribute is used because there's no need to associate the schema with a namespace. If you wanted to associate it with a namespace, you would use the schemaLocation attribute instead:

<trainlog

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://www.xyz.com/ns/etml etml.xsd">

In the schemaLocation attribute, two pieces of information are provided: the namespace for the schema and the location of the schema document. The schemaLocation attribute is useful whenever you are working with a schema and you want to associate it with a namespace.

To establish a prefix for the tags and attributes, you must declare the namespace, as shown in this code:

<trainlog

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:etml="http://www.xyz.com/ns/etml"

xsi:schemaLocation="http://www.xyz.com/ns/etml etml.xsd">

Now the prefix "etml" can be used to reference tags and attributes as part of the namespace, as in <etml:distance>.

This is a high-level overview, and there's a lot more to XML Schema namespaces than can be covered in a single response. I recommend reading the provided source for a more in-depth understanding.

Validating XML with XML Schema

XML Schema Definition (XSD) is used to validate XML documents to ensure they adhere to a specific structure and content. This process involves checking the data types, relationships between elements and their attributes, and constraints verification. It's a more advanced form of validation compared to Document Type Definition (DTD) as it also understands the semantics of the schema for validation. This [source](https://www.section.io/engineering-education/validating-xml-using-xsd/) provides a comprehensive guide on how to validate XML using XSD.

Here are the key steps involved in validating an XML document using XSD:

1. XSD Declaration: This is the XML declaration statement along with namespace details. It's mandatory for any XSD document. The namespace ensures that the elements and data types used in the schema are unique on the web to avoid conflicts.

2. Validating the Outermost Element: The outermost element in the XML document is validated first. This involves handling the child elements under a complex type with a specific name.

3. Validating Child Elements: Each child element under the outermost element is validated based on the type specified under the type of the element.

4. Validating Sub-Child Elements: Each sub-child element is validated individually. This can involve checking data types, matching patterns, and checking for specific values.

5. Using an XML Validator: After the validations are set up, an XML validator can be used to check the XML document against the XSD. This can be done using online validators or by downloading XML validators locally to your text editors.

Remember, the validation performed by XSD not only parses the value, but also validates based on restrictions. This makes it a powerful tool for ensuring the integrity and correctness of XML documents.

Thanks for reading till the very end.

Follow me on Twitter, LinkedIn and GitHub for more amazing blogs about Tech and More!

1
Subscribe to my newsletter

Read articles from Ronit Banerjee directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Ronit Banerjee
Ronit Banerjee

Building ProjectX.Cloud | GSoC'23 @ DBpedia | DevOps & Cloud Computing