XML Parsing With SAX and Xerces (part 1)

Find out how to parse XML in Java.

Wedding Bells

If you've been paying attention over the last few weeks, you'll already know a little bit about XML, and how it hopes to alter the way data is classified and used on the Web. By marking up data fragments with HTML-like tags and attributes, XML provides the content author with an efficient and simple method of describing data...and the Web developer with a powerful new weapon to add to his or her arsenal.

Now, XML data is physically stored in text files, as pure ASCII. As a format, this is as close to universal as you can get - every computer system on planet Earth can read and process ASCII text, making XML extremely portable between platforms and systems. Tie this in with that other platform-independent language, Java, and you have a marriage made in cross-platform heaven.

Over the course of this two-part article, I'll be examining the union of Java and XML, illustrating how the two technologies can be combined to easily parse XML data and convert it into browser-friendly HTML. My tool in this endeavour will be the Xerces XML parser, a validating Java-based parser which supports the XML 1.0, DOM Level 2, SAX 1.0 and 2.0 and XML Schema standards. Highly configurable, and with a rich feature set, Xerces is a part of the Apache XML Project, and is designed to meet the twin standards of performance and compatibility when parsing XML documents.

I'll try and keep it simple - I'm going to use very simple XML sources, so you don't have to worry about namespaces, DTDs and PIs - although I will assume that you know the basic rules of XML markup, and of Java programming. So let's get this show on the road.

Playing The SAX

Now, you may not know this, but there are two basic approaches to parsing an XML document. The first of these approaches is SAX, the Simple API for XML, which works by iterating through an XML document and calling specific functions every time it encounters an XML structure. The parser's responsibility here is limited to simply reading the document and transferring control to the specified functions whenever it hits an XML construct; the functions called are responsible for actually processing the XML construct found, and the information embedded within it.

In case this doesn't sound all that appealing, there's also an alternative approach: construct a tree structure representing the XML data in memory and then traverse the branches of the tree to get to the fruit - the data - hanging on to them. This approach involves using the Document Object Model, and will be discussed in a later segment of this tutorial.

There are a couple of obvious advantages to using a Java-based parser to parse an XML document. First, Java code is compiled into bytecode and stored on the server; this speeds up access time, since the code is only compiled once (the first time it is accessed) with subsequent accesses being much faster than the equivalent CGI or PHP code. Then there's the portability issue, already touched upon in the previous page - Java code is cross-platform, which means that you can write an application once, then move it to any platform for which a Java virtual machine exists, and it will run as expected, with no additional tweaks or modifications required.

The Xerces Java Parser (version 1.4.4 is what I'll be using) supports the latest version of SAX, SAX 2.0, in addition to the earlier SAX 1.0 standard. It also includes support for XML Schema and the DOM Level 2 standard. Note, however, that since XML standards are constantly evolving, using Xerces can sometimes produce unexpected results; take a look at the documentation provided with the parser, and at the information available on its official Web site, for errata and bugs.

With the introductions out of the way, let's put together the tools you'll need to get started with Xerces. Here's a quick list of the software you'll need:

  1. The Java Development Kit (JDK), available from the Sun Microsystems Web site (http://java.sun.com)

  2. The Apache Web server, available from the Apache Software Foundation's Web site (http://httpd.apache.org)

  3. The Tomcat Application Server, available from the Apache Software Foundation's Web site (http://httpd.apache.org)

  4. The Xerces parser, available from the Apache XML Project's Web site (http://xml.apache.org)

  5. The mod_jk extension for Apache-Tomcat communication, available from the Jakarta Project's Web site (http://httpd.apache.org)

Installation instructions for all these packages are available in their respective source archives. In case you get stuck, you might want to look at http://www.devshed.com/Server_Side/Java/JSPDev, or at the Tomcat User Guide at http://jakarta.apache.org/tomcat/tomcat-3.3-doc/tomcat-ug.html

Reaching For The Nailgun

Let's begin with a simple XML file, which displays the marked-up inventory statement for a business selling equipment for Quake enthusiasts:

<?xml version="1.0"?>
<inventory>
    <item>
        <id>758</id>
        <name>Rusty, jagged nails for nailgun</name>
        <supplier>NailBarn, Inc.</supplier>
        <cost>2.99</cost>
        <quantity>10000</quantity>
    </item>
    <item>
        <id>6273</id>
        <name>Power pack for death ray</name>
        <supplier>QuakePower.domain.com</supplier>
        <cost>9.99</cost>
        <quantity>10</quantity>
    </item>
</inventory>

Now, we need a simple Java application that will initialize the SAX parser, read and parse the XML file, and fire the callback functions as it encounters tags.

import org.apache.xerces.parsers.SAXParser;
import org.xml.sax.*;
import java.io.*;

// the ContentHandler interface handles all the callbacks
public class MyFirstSaxApp implements ContentHandler {

    // constructor
    public MyFirstSaxApp (String xmlFile) {

        //  create a Xerces SAX parser
            SAXParser parser = new SAXParser();

            //  set the content handler
            parser.setContentHandler(this);

            //  parse the document
            try {
                parser.parse(xmlFile);
            } catch (SAXException e) {
                System.err.println (e);
            } catch (IOException e) {
                System.err.println (e);
            }
    }

    //  call this when a start tag is found
    public void startElement (String uri, String local, String qName, Attributes atts)  {
            System.out.println ("Found element: " + local);
    }

    // the remaining callback handlers
    // they don't do anything right now...but keep reading!
    public void setDocumentLocator(Locator locator) {}
    public void startDocument() {}
    public void endDocument() {}
    public void characters(char[] text, int start, int length){}
    public void startPrefixMapping(String prefix, String uri) {}
    public void endPrefixMapping(String prefix) {}
    public void endElement(String namespaceURI, String localName, String qualifiedName) {}
    public void ignorableWhitespace(char[] text, int start, int length) throws SAXException {}
    public void processingInstruction(String target, String data){}
    public void skippedEntity(String name) {}

    // everything starts here
    public static void main (String[] args) {
            MyFirstSaxApp myFirstExample = new MyFirstSaxApp(args[0]);
    }
}

Sure, it looks a little intimidating - but fear not, all will be explained shortly. Before I get to that, though, it's instructive to see what the output of this looks like. So, how about we compile it and run it?

$ javac MyFirstSaxApp.java

Assuming that all goes well, you should now have a class file named "MyFirstSaxApp.class". Copy this class file to your Java CLASSPATH, and then execute it, with the name of the XML file as argument.

$ java MyFirstSaxApp /home/me/sax/inventory.xml

And here's what you should see:

Found element: inventory
Found element: item
Found element: id
Found element: name
Found element: supplier
Found element: cost
Found element: quantity
Found element: item
Found element: id
Found element: name
Found element: supplier
Found element: cost
Found element: quantity

What's happening here? Every time the parser encounters a start tag within the XML document, it calls the startElement()function, which prints the name of the tag to the standard output device. The parser then moves on to the next construct within the document, calling the appropriate callback function to handle it. This process continues until the entire XML document has been processed.

Under The Microscope

Let's take a closer look at the code from the previous example:

import org.apache.xerces.parsers.SAXParser;
import org.xml.sax.*;
import java.io.*;

Here, I've imported all the classes required to execute the application. First come the classes for the Xerces SAX parser, followed by other classes related to and required for SAX processing and the core Java classes for file I/O and error handling.

Along with the set of classes that define the parser, the SAX API also comes equipped with a set of useful interfaces. The one used here is the ContentHandler interface, which defines the callback functions and constants needed for SAX processing.

public class MyFirstSaxApp implements ContentHandler {

    // code here

}

Next, a constructor is defined for the class (in case you don't know, a constructor is a method that is invoked automatically when you create an instance of the class).

    // constructor
    public MyFirstSaxApp (String xmlFile) {

        //  create a Xerces SAX parser
        SAXParser parser = new SAXParser();

        //  set the content handler
        parser.setContentHandler(this);

        //  parse the document
        try {
            parser.parse(xmlFile);
        } catch (SAXException e) {
            System.err.println (e);
        } catch (IOException e) {
            System.err.println (e);
        }
    }

Once an instance of the parser has been created, the content handler for the parser needs to be defined with the setContentHandler() method. Since the SAXParser class itself implements the ContentHandler interface, it can be transparently used here.

Finally, the parse() method handles the actual parsing of the XML document - it accepts the file name as method argument. This method call is enclosed within a "try-catch" error handling block, in order to gracefully recover from errors. In this example, two types of errors have been accounted for: the SAXException error, which is raised when the SAX parser encounters a discrepancy in the XML document (for example, badly-nested tags), and the IOException error, which is raised when a file I/O error occurs.

That takes care of the main infrastructure code - but what about the callback functions themselves?

    //  call this when a start tag is found
    public void startElement (String uri, String local, String qName, Attributes atts)  {
        System.out.println ("Found element: " + local);
    }

In this case, I've only defined a callback for opening elements. This callback function must be named startElement(); it's invoked whenever the parser encounters an opening XML element, and automatically receives the namespace URI, element name, fully qualified name and attributes of the element that triggers it. This data can then be processed and used in whatever manner you desire - over here, I'm simply printing it to the standard output device.

A number of other callbacks are also available - however, I've left them to their own devices here. These callbacks handle all the events that the SAX parser generates, providing a wrapper for processing XML documents, elements, character data, PIs and entities.

    // the remaining callback handlers
    // they don't do anything right now...but keep reading!
    public void setDocumentLocator(Locator locator) {}
    public void startDocument() {}
    public void endDocument() {}
    public void characters(char[] text, int start, int length){}
    public void startPrefixMapping(String prefix, String uri) {}
    public void endPrefixMapping(String prefix) {}
    public void endElement(String namespaceURI, String localName, String qualifiedName) {}
    public void ignorableWhitespace(char[] text, int start, int length) throws SAXException {}
    public void processingInstruction(String target, String data){}
    public void skippedEntity(String name) {}

You may be wondering if you really need to define these, since their sum contribution to the functionality of this program is zero. The short answer is, yes, you do; since you're implementing an interface, you must include all the methods within it. If you don't, Java will barf all over your screen - try it and see for yourself!

Finally, the main() method sets the ball rolling, instantiating an instance of my user-defined class, with the argument entered by the user (the XML file location) as an input parameter.

    // everything starts here
    public static void main (String[] args) {
        MyFirstSaxApp myFirstExample = new MyFirstSaxApp(args[0]);
    }

Next, let's look at streamlining this a little, with a slightly different technique.

Sweeping Up The Mess

The example you just saw contains a whole bunch of callback method definitions that don't actually do anything. This merely clutters your code while adding very little by way of functionality. Which is why the guys behind Xerces came up with an alternative to the ContentHandler interface, called the DefaultHandler interface.

This "helper" class implements the ContentHandler class, along with all of its callbacks; however, these callbacks do nothing by default. The developer can then selectively override these empty callbacks as per the requirements of the application - this is more efficient in general, and also results in cleaner code.

Take a look at this version of the previous example, this one built on the DefaultHandler class instead of the ContentHandler class:

import org.apache.xerces.parsers.SAXParser;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import java.io.*;

public class MySecondSaxApp extends DefaultHandler {

  // constructor
  public MySecondSaxApp (String xmlFile) {

    //  create a Xerces SAX parser
    SAXParser parser = new SAXParser();

    //  set the content handler
    parser.setContentHandler(this);

    //  parse the document
    try {
        parser.parse(xmlFile);
    } catch (SAXException e) {
        System.err.println (e);
    } catch (IOException e) {
        System.err.println (e);
    }
  }

  // call when start elements are found
  public void startElement (String uri, String local, String qName, Attributes atts)  {
    System.out.println ("Found element: " + local);
  }

  // everything starts here
  public static void main (String[] args) {
    MySecondSaxApp mySecondExample = new MySecondSaxApp(args[0]);
  }
}

This example is much more readable than the previous one, primarily because the entire set of callbacks defined previously are conspicuously absent here. As stated above, these empty callbacks are already provided by the DefaultHandler class; a developer only needs to redefine those which are actually required by the application.

Note, though, that extending the DefaultHandler class requires you to add one more item to your list of includes at the top of the program:

import org.xml.sax.helpers.DefaultHandler;


Here's the output:

Found element: inventory Found element: item Found element: id Found element: name Found element: supplier Found element: cost Found element: quantity Found element: item Found element: id Found element: name Found element: supplier Found element: cost Found element: quantity


## Diving Deeper

This next example goes beyond the simple applications you've just seen to provide a more comprehensive XML parsing and processing demonstration. Here's the XML file I plan to use:

<?xml version="1.0"?>
<inventory>
    <item>
        <id>758</id>
        <name>Rusty, jagged nails for nailgun</name>
        <supplier>NailBarn, Inc.</supplier>
        <cost currency="USD">2.99</cost>
        <quantity alert="500">10000</quantity>
    </item>
    <item>
        <id>6273</id>
        <name>Power pack for death ray</name>
        <supplier>QuakePower.domain.com</supplier>
        <cost currency="USD">9.99</cost>
        <quantity alert="20">10</quantity>
    </item>
</inventory>

Now, how about parsing this XML file and displaying a breakup of the data contained within it? With SAX, it's a snap!

import org.apache.xerces.parsers.SAXParser; import org.xml.sax.; import org.xml.sax.helpers.DefaultHandler; import java.io.; public class MyThirdSaxApp extends DefaultHandler {

// constructor
public MyThirdSaxApp (String xmlFile){

    //  create a Xerces SAX parser
    SAXParser parser = new SAXParser();

    //  set the content handler
    parser.setContentHandler(this);

    //  parse the document
    try{
        parser.parse(xmlFile);
    } catch (SAXException e) {
        System.err.println (e);
    } catch (IOException e) {
        System.err.println (e);
    }
}

// callback definitions start here
// call this at document start
public void startDocument() {
    System.out.println ("Document begins");
}

//  call this when start tag found
public void startElement (String uri, String local, String qName, Attributes atts){
    System.out.println ("Element begins: \"" + local + "\"");
    String AttributeName,AttributeType,AttributeValue = "";
    for (int i = 0; i < atts.getLength(); i++) {
        AttributeName = atts.getLocalName(i);
        AttributeType = atts.getType(AttributeName);
        AttributeValue = atts.getValue(AttributeName);
        System.out.println ("Attribute: \"" + AttributeName + "\"");
        System.out.println ("\tType: \"" + AttributeType + "\"");
        System.out.println ("\tValue: \"" + AttributeValue + "\"");
    }
}

// call this when CDATA found
public void characters(char[] text, int start, int length){
    String Content = new String(text, start, length);
    if (!Content.trim().equals("")){
        System.out.println("Character data: \"" + Content + "\"");
    }
}

//  call this when end tag found
public void endElement (String uri, String local, String qName){
    System.out.println("Element ends: \"" + local + "\"");
}

// call this at document end
public void endDocument(){
    System.out.println ("Document ends");
}

// the main method
public static void main (String[] args) {
    MyThirdSaxApp myThirdExample = new MyThirdSaxApp(args[0]);
}

}


Here's the output:

Document begins Element begins: "inventory" Element begins: "item" Element begins: "id" Character data: "758" Element ends: "id" Element begins: "name" Character data: "Rusty, jagged nails for nailgun" Element ends: "name" Element begins: "supplier" Character data: "NailBarn, Inc." Element ends: "supplier" Element begins: "cost" Attribute: "currency" Type: "CDATA" Value: "USD" Character data: "2.99" Element ends: "cost" Element begins: "quantity" Attribute: "alert" Type: "CDATA" Value: "500" Character data: "10000" Element ends: "quantity" Element ends: "item" Element begins: "item" Element begins: "id" Character data: "6273" Element ends: "id" Element begins: "name" Character data: "Power pack for death ray" Element ends: "name" Element begins: "supplier" Character data: "QuakePower.domain.com" Element ends: "supplier" Element begins: "cost" Attribute: "currency" Type: "CDATA" Value: "USD" Character data: "9.99" Element ends: "cost" Element begins: "quantity" Attribute: "alert" Type: "CDATA" Value: "20" Character data: "10" Element ends: "quantity" Element ends: "item" Element ends: "inventory" Document ends


Most of this should be familiar to you by now, so I'm going to concentrate on the callback functions used in the example above:

First up, the startDocument() callback, invoked when the parser encounters the beginning of an XML document. Here, the function merely prints a string indicating the start of the document; you could also use it to print a header, or initialize document-specific variables.
// call this at document start
public void startDocument() {
    System.out.println ("Document begins");
}

Next, it's the turn of the startElement() callback, discussed in detail a few pages back...although this one adds a new wrinkle by also accounting for element attributes.
public void startElement (String uri, String local, String qName, Attributes atts) {
    System.out.println ("Element begins: \"" + local + "\"");
    String AttributeName,AttributeType,AttributeValue = "";
    for (int i = 0; i < atts.getLength(); i++) {
        AttributeName = atts.getLocalName(i);
        AttributeType = atts.getType(AttributeName);
        AttributeValue = atts.getValue(AttributeName);
        System.out.println ("Attribute: \"" + AttributeName + "\"");
        System.out.println ("\tType: \"" + AttributeType + "\"");
        System.out.println ("\tValue: \"" + AttributeValue + "\"");
    }
}

Note that attributes attached to the element are automatically passed to the startElement() callback as an array. Detailed information on each attribute in this array can be obtained via the functions getName(), getType() and getValue().

The characters() callback handles character data, and receives the CDATA string as argument:
public void characters(char[] text, int start, int length) {
    String Content = new String(text, start, length);
    if (!Content.trim().equals("")) {
        System.out.println("Character data: \"" + Content + "\"");
    }
}

Sadly, this information is passed as an array of individual characters, rather than a single string. This means lots of extra processing to get the information into a usable format - which accounts for much of the code above.

It's important to note that the parser will also invoke the characters() callback when it encounters whitespace within the XML document. As you might imagine, this can lead to strange results, especially if you're new to XML programming. I've used the trim() string function to spare myself the agony - you should do the same.

The endElement()callback is invoked when the parser hits the end of an element - note that this callback receives the ending element name as argument.
public void endElement (String uri, String local, String qName){
    System.out.println("Element ends: \"" + local + "\"");
}

Finally, the endDocument() callback is triggered when the end of the document is reached.
public void endElement (String uri, String local, String qName){
    System.out.println("Element ends: \"" + local + "\"");
}


All these callbacks, acting in concert, result in the output described a few paragraphs back.

Obviously, this is just one illustration of the applications of the Xerces SAX parser. You can do a lot more with it...and in the second part of this article, I'll build on everything you just learnt to demonstrate how the Xerces SAX parser can be combined with JSP to format XML documents for a Web browser. I'll also take a look at the error-handling functions built into the parser, demonstrating how they can be used to trap and catch errors in XML processing. Make sure you come back for that one!

Note: All examples in this article have been tested with JDK 1.3.0, Apache 1.3.11, mod_jk 1.1.0, Xerces 1.4.4 and Tomcat 3.3. Examples are illustrative only, and are not meant for a production environment. YMMV!
This article was first published on18 Jan 2002.