Sequential XML Processing With The XMLReader Object (part 1)

Learn to process an XML document with the XMLReader object.

Different Strokes

If you've been paying attention over the last few weeks, you already know a little bit about parsing XML on the .NET platform. And if you haven't been, it's about time you did, because XML is already starting to alter the way data is classified and used on the Internet.

Now, a few weeks ago, I introduced the XmlDocument object, one of the important classes available in the .NET Framework for XML manipulation. This object is designed to read an XML file, build a tree to represent the structures found within it, and offer object methods and properties to manipulate them.

But hey, there's always another way to skin a cat - and to prove it, this time around, I'll be examining a new offering for .NET/XML developers. It's called an XMLReader, and it provides an alternative pull model of dealing with XML data. So pay attention - this is cutting-edge stuff, and it's only gonna get more interesting....

Push And Pull

If you're at all familiar with XML programming, you'll be aware that there are two basic approaches to parsing an XML document. The Simple API for XML (SAX) is one; it parses an XML document in a sequential manner, generating and throwing events for the application layer to process as it encounters different XML elements. This sequential approach enables rapid parsing of XML data, especially in the case of long or complex XML documents; however, the downside is that a SAX parser cannot be used to access XML document nodes in a random or non-sequential manner.

Next, we have the Document Object Model (DOM). This alternative approach involves building a tree representation of the XML document in memory, and then using built-in methods to navigate through this tree. Once a particular node has been reached, built-in properties can be used to obtain the value of the node, and use it within the script. This tree-based paradigm does away with the problems inherent in SAX's sequential approach, allowing for immediate random access to any node or collection of nodes in the tree.

Now, I've already shown you how to use the DOM approach to parsing XML with .NET's XMLDocument object. However, while the DOM does offer seamless access to your XML data, it comes at the cost of performance. This is especially noticeable if your application has to deal with large XML files. This trade-off between performance and ease of use is one of the more knotty problems developers had to face when designing an XML application.

Notice I said "had". Microsoft has a possible solution, one that incorporates the best of both worlds. They call it the "pull model" and, according to their documentation at http://support.microsoft.com/default.aspx?scid=KB;EN-US;Q313816&, it's designed to provide "forward-only, read-only, noncached access to XML data". This means that you can now read an XML document in a sequential but selective manner and thereby control the process of parsing. This is an interesting variant of the SAX model, which is non-selective in nature - there, the parser will notify the client about each and every item that it encounters in the XML stream. This is analogous to a customer, in a restaurant, ordering his or her choice after reading a menu as opposed to the waiter stuffing all the items down his throat.

Class Act

The XMLReader abstract class plays a very important role in implementing the new "pull model". As part of the System.XML tree, the primary objective of this class is to provide developers with a framework to implement this new model. If you're an adventurous developer, you can use this abstract class as the basis for your very own, custom-crafted XMLReader object. Or you could do what I did: take the easy way out and utilize any one of the built-in classes that already do this for you.

The .NET framework provides three such built-in classes:

  1. The plain-vanilla XmlTextReader class behaves as a "forward-only, noncached reader" to read XML data. It's versatile enough to allow you to access XML from different input sources, including flat files, data streams, or URLs.

  2. The XmlTextReader has one little drawback: it doesn't allow you to validate the data present in the XML source. If you are looking for a foolproof way to maintain the sanctity of your data, you are better off using the XmlValidatingReader class. This is the only class in this category that comes with built-in features to validate your XML data against external DTDs, XDR or XSD schemas.

  3. In case you're looking to implement the "pull model" on a DOM tree that's already present in memory, you can consider using the XmlNodeReader class. Best-suited only for the very specialized application mentioned above, this class allows you read the data from specific nodes of the tree and enjoy a double benefit: the speed associated with the XMLReader class and the ease of use of the DOM.

Now that you know the theory, how about seeing it work in the real world?

Visiting The Library

I'll begin with a simple example - using an XmlTextReader to parse a static XML file. Here's the XML file, a list of books present in our technical library:

<?xml version='1.0'?>
<library>
    <book id="MFRE001">
        <title>XML and PHP</title>
        <author>Vikram Vaswani</author>
        <description>Learn to manage your XML data with PHP</description>
        <price currency="USD">24.95</price>
    </book>
    <book id="MFRE002">
        <title>MySQL - The Complete Reference</title>
        <author>Vikram Vaswani</author>
        <description>Learn everything about this open source database</description>
        <price currency="USD">45.95</price>
    </book>
</library>

And now for the ASP.NET code that will allow us to parse this XML file using the XmlTextReader object:

<%@ Page Language="C#"%>
<%@ import  namespace="System.Xml"%>
<html>
<head>
<script runat="server">
void Page_Load()
{
    // location of XML file
    string strXmlFile = "http://localhost/xmlpull/library.xml";

    // create an instance of the XmlTextReader object
    XmlTextReader objXmlRdr = new XmlTextReader(strXmlFile);

    // ignore whitespace in the XML file
    objXmlRdr.WhitespaceHandling=WhitespaceHandling.None;

        String strSpaces;

    while(objXmlRdr.Read()) {

            // only process the elements, ignore everything else
        if(objXmlRdr.NodeType==XmlNodeType.Element) {

            // reset the variable for a new node
            strSpaces = "";

            for(int count = 1; count <= objXmlRdr.Depth; count++) {
                strSpaces += "===";
            }

            output.Text += strSpaces + "=> " + objXmlRdr.Name + "<br/>";
        }
    }

// close the object and free up memory
objXmlRdr.Close();
}
</script>
</head>
<body>
<asp:label id="output" runat="server" />
</body>
</html>

Before I get into the nitty-gritty of the code, here's what you should see when you run this script:

Let's look at the code in detail:

  1. The first step is to import all the classes required to execute the application - the .NET libraries for the XML parser, which are part of the System.XML namespace.
<%@ import  namespace="System.Xml"%>
  1. Next, within the Page_Load() function, I have defined some variables and objects. The first is a string variable to store the location of the XML file, and the second is a local instance of the XmlTextReader object. Finally, in order to tell the parser to ignore the whitespace present in the XML file, I set the "WhitespaceHandling" property of the XmlTextReader object to "None", as shown below:
<%
    // location of XML file
    string strXmlFile = "http://localhost/xmlpull/library.xml";

    // create an instance of the XmlTextReader object
    XmlTextReader objXmlRdr = new XmlTextReader(strXmlFile);

    // ignore whitespace in the XML file
    objXmlRdr.WhitespaceHandling=WhitespaceHandling.None;
%>
  1. The next step is to read the XML file - a simple matter, since the object provides a Read() method for just this purpose. This method returns true if it encounters a node in the XML file. Once it is finished with the file, it returns false. This makes it easy to process an entire file, simply by wrapping the method call in a "while" loop.
<%
while(objXmlRdr.Read()) {
    // process the XML data
        }
%>
  1. Of course, it doesn't make sense to read the entire file and not do anything with it. That's why, within the "while" loop, I've added the code to process element nodes and format them for display.
<%

while(objXmlRdr.Read()) {

                // only process the elements
                if(objXmlRdr.NodeType==XmlNodeType.Element) {

            // reset the variable for a new node
                    strSpaces = "";

            for(int count = 1; count <= objXmlRdr.Depth; count++) {
                strSpaces += "===";
            }

            output.Text += strSpaces + "=> " + objXmlRdr.Name + "<br/>";
        }
    }
%>

The "NodeType" property of the current node can be used to filter out the elements for further processing. Note that if I hadn't included this condition at the beginning of the loop, the output would also contain processing instructions like

<?xml version='1.0'?>

Don't take my word for it - change the code and see for yourself!

The rest of the code in the "while" loop ensures that the output is formatted properly for display in the browser. Pay special attention to my use of the very cool "Depth" property, which holds an integer value specifying the depth of the current node in the tree hierarchy. Simply put, the element <library> is at depth 0, the element <book> is at depth 1, and so on.

Digging Deeper

So that takes care of handling elements - but what about the attributes contained within each element? Take a look at this second example, which demonstrates how to process attributes using the XmlTextReader class:

<%@ Page Language="C#"%>
<%@ import  namespace="System.Xml"%>
<html>
<head>
<script runat="server">
void Page_Load()  {

    string strXmlFile = "http://localhost/xmlpull/library.xml";

    // create an instance of the XmlTextReader object
    XmlTextReader objXmlRdr = new XmlTextReader(strXmlFile);

    // ignore whitespace in the XML file
    objXmlRdr.WhitespaceHandling=WhitespaceHandling.None;

    String strSpaces;

    while(objXmlRdr.Read()) {

        // only process the elements
        if(objXmlRdr.NodeType==XmlNodeType.Element) {

                // reset the variable for a new node
                    strSpaces = "";

            for(int count = 1; count <= objXmlRdr.Depth; count++) {
                strSpaces += "===";
            }

            output.Text += strSpaces + "=> " + objXmlRdr.Name;

            // check if the element has any attributes
            if(objXmlRdr.HasAttributes)  {

                output.Text += " [";
                for(int innercount = 0; innercount < objXmlRdr.AttributeCount; innercount++) {

                    // read the current attribute
                    objXmlRdr.MoveToAttribute(innercount);
                    output.Text += objXmlRdr.Name;
                }

            output.Text += "]";

            // instruct the parser to go back the element
            objXmlRdr.MoveToElement();
            }

        output.Text += "<br/>";
        }
    }

    // close the object and free up memory
    objXmlRdr.Close();
}
</script>
</head>
<body>
<asp:label id="output" runat="server" />
</body>
</html>

Here's the output:

As you can see, there is only one major change to the original code listing - handling attributes for each element that the reader encounters in the XML file:

<%
    // check if the element has any attributes
    if(objXmlRdr.HasAttributes)  {

        output.Text += " [";
        for(int innercount = 0; innercount < objXmlRdr.AttributeCount; innercount++) {

            // read the current attribute
            objXmlRdr.MoveToAttribute(innercount);
            output.Text += objXmlRdr.Name;
        }

        output.Text += "]";

        // instruct the parser to go back the element
        objXmlRdr.MoveToElement();
    }

%>

The above code snippet makes for interesting reading. It begins with a check for attributes in the current node using the "HasAttributes" property (this property is set to true if the current node has at least one attribute). The XmlTextReader object's "AttributeCount" property stores the total number of attributes and is useful for looping through the collection of attributes. The MoveToAttribute() method positions the reader at the next attribute in the collection, and the "Name" property is then used to get the name of the attribute. Once iteration through the attributes of the current node is complete, the MoveToElement() method resets the position of the reader, and it then proceeds to the next node (if it exists).

Into The Real World

Now, if you're a developer, I'm sure the previous two examples would have raised your eyebrows a bit. The reason is simple: the examples I've shown you thus far have only studied the information structures in the XML file, completely ignoring the data contained within each attribute and element. In the real world, you're usually as concerned about the data within each element as about the element and attribute names.

That's where this next example comes in - it completes the circle, showing you how to process the data stored within each attribute and element. Take a look:

<%@ Page Language="C#" Debug="true"%>
<%@ Import namespace="System.Xml"%>
<html>
<head>
<script runat="server">
void Page_Load()
{

    // variable to store Book ID
    string strBookId = "";

    // variable to store the Xml file (with location
    string strXmlRdr = "http://localhost/xmlpull/library.xml";

    output.Text="<B>List of Books</B>";

    // create an instance of the XmlDocument object
    XmlTextReader objXmlRdr = new XmlTextReader(strXmlRdr);

    objXmlRdr.WhitespaceHandling=WhitespaceHandling.None;
    output.Text += "<ul>";

    while(objXmlRdr.Read()) {

        if(objXmlRdr.NodeType==XmlNodeType.Element) {

            if(objXmlRdr.Name == "book") {
                strBookId = objXmlRdr.GetAttribute("id");
            }

            if(objXmlRdr.Name=="title") {
                output.Text += "<li>" + objXmlRdr.ReadString();
                output.Text += "<ul>";
                output.Text += "<li>ID - " + strBookId + "</li>";
            }

            if(objXmlRdr.Name=="author") {
                output.Text += "<li>Author - " + objXmlRdr.ReadString() + "</li>";
            }

            if(objXmlRdr.Name=="description") {
                output.Text += "<li>Description - " + objXmlRdr.ReadString() + "</li>";
            }

            if(objXmlRdr.Name=="price") {
                output.Text += "<li>Price - " + objXmlRdr.GetAttribute("currency") + " " + objXmlRdr.ReadString() + "</li>";
            }
        } else if(objXmlRdr.NodeType==XmlNodeType.EndElement) {

            if(objXmlRdr.Name == "book" ) {
                output.Text += "</ul>";
                output.Text += "</li>";
                strBookId = ""; // reset the Book Id variable
            }
        }
    }

    output.Text += "</ul>";

    // close the object and free up memory
    objXmlRdr.Close();
}
</script>
</head>
<body>
<asp:label id="output" runat="server"/>
</body>
</html>

Load this example in the browser to see the list of books on the shelves of the library:

I'll begin by drawing your attention to the definition of a variable right at the beginning of the script:

<%

// variable to store Book ID
string strBookId = "";

%>

This variable will be used further down in the script to store the ID of the book.

Now, the process of reading the XML file starts with the Read() method of the XmlTextReader object. This next code snippet does the dirty work of processing the data that is read by the object.

<%

if(objXmlRdr.NodeType==XmlNodeType.Element) {

    if(objXmlRdr.Name == "book") {
        strBookId = objXmlRdr.GetAttribute("id");
    }

    if(objXmlRdr.Name=="title") {
        output.Text += "<li>" + objXmlRdr.ReadString();
        output.Text += "<ul>";
        output.Text += "<li>ID - " + strBookId + "</li>";
    }

    if(objXmlRdr.Name=="author") {
        output.Text += "<li>Author - " + objXmlRdr.ReadString() + "</li>";
    }

    if(objXmlRdr.Name=="description") {
        output.Text += "<li>Description - " + objXmlRdr.ReadString() + "</li>";
    }

    if(objXmlRdr.Name=="price") {
        output.Text += "<li>Price - " + objXmlRdr.GetAttribute("currency") + " " + objXmlRdr.ReadString() + "</li>";
    }

} else if(objXmlRdr.NodeType==XmlNodeType.EndElement) {

    if(objXmlRdr.Name == "book" ) {
        output.Text += "</ul>";
        output.Text += "</li>";
        strBookId = ""; // reset the Book Id variable
    }
}

%>

It all starts with a check to see if the current node is an element. As seen in the first example, this test returns true when the reader encounters the starting tag of an element in the XML file. Once this is confirmed, the script checks the name of each element that so that it can be processed appropriately. Note that you can also use the IsStartElement() method of the XmlTextReader object to check whether an element is indeed the opening element.

Element processing starts with the <book> element. Since I need the book ID, I've used the shortcut GetAttribute() method of the XmlTextReader object to fetch the value stored in the "id" attribute. If you know which attribute you want, this is a convenient way to avoid having to unnecessarily iterate through the collection of attributes, as demonstrated earlier. The ID retrieved is stored in the "strBookId" variable created earlier.

During the next pass, the script will encounter the other parameters associated with a particular book - its title, description, price, currency and so on. For each of these elements, the ReadString() method can be used to retrieve the text stored in the corresponding element.

Once a particular book has been dealt with, the "strBookId" variable must be reset for the next book in the library. A good place to do this is when the reader encounters the closing element. How do you know when this happens? It's simple - just check if a particular node is a closing element with the "EndElement" property and if its name is , and Bob's your uncle!

As you can see, once you know the basics of reading an XML file with the XMLReader, it's very easy to begin using its built-in constructs to extract and manipulate XML data to your precise needs. As an exercise to better understand how this work, I recommend taking your own XML markup and writing a similar script to extract element and attribute values from it. After all, practice makes perfect!

And that's about it for the first part of this tutorial. Over the last few pages, I introduced you to the XMLReader class, which offers developers an alternative way of processing an XML file or stream. Unlike the DOM, the XMLReader class offers developers a framework for sequential reading, making it possible to create faster, more streamlined XML applications.

At the beginning of this tutorial, I told you that the .NET Framework came with three important classes derived from the XMLReader abstract class. Over the course of the last few pages, I introduced you to the first and most-used of these, the XMLTextReader class, and showed you how to use it to process elements, attributes and the data within them. The second part of this article will deal more thoroughly with the remaining two classes, showing you how to validate an XML document against a DTD or XML Schema before processing it, and explaining how to handle errors in an XML document. Make sure you come back for that....and until then, be good!

Note: Examples are illustrative only, and are not meant for a production environment. Melonfire provides no warranties or support for the source code described in this article. YMMV!

This article was first published on16 Jan 2004.