WebReference.com - Chapter 3 from Perl & XML, from O'Reilly and Associates (8/12)
[previous] [next] |
Perl & XML
XML::LibXML
XML::LibXML
, like XML::Parser
, is an interface to a library written in C. Called libxml2
, it's part of the GNOME project.[1] Unlike XML::Parser
, this new parser supports a major standard for XML tree processing known as the Document Object Model (DOM).
DOM is another much-ballyhooed XML standard. It does for tree processing what SAX does for event streams. If you have your heart set on climbing trees in your program and you think there's a likelihood that it might be reused or applied to different data sources, you're better off using something standard and interchangeable. Again, we're happy to delve into DOM in a future chapter and get you thinking in standards-complaint ways. That topic is coming up in Chapter 7.
Now we want to show you an example of another parser in action. We'd be remiss if we focused on just one kind of parser when so many are out there. Again, we'll show you a basic example, nothing fancy, just to show you how to invoke the parser and tame its power. Let's write another document analysis tool like we did in Example 3-5, this time printing a frequency distribution of elements in a document.
Example 3-6 shows the program. It's a vanilla parser run because we haven't set any options yet. Essentially, the parser parses the filehandle and returns a DOM object, which is nothing more than a tree structure of well-designed objects. Our program finds the document element, and then traverses the entire tree one element at a time, all the while updating the hash of frequency counters.
Example 3-6: A frequency distribution program
use XML::LibXML;
use IO::Handle;
# initialize the parser
my $parser = new XML::LibXML;
# open a filehandle and parse
my $fh = new IO::Handle;
if( $fh->fdopen( fileno( STDIN ), "r" )) {
my $doc = $parser->parse_fh( $fh );
my %dist;
&proc_node( $doc->getDocumentElement, \%dist );
foreach my $item ( sort keys %dist ) {
print "$item: ", $dist{ $item }, "\n";
}
$fh->close;
}
# process an XML tree node: if it's an element, update the
# distribution list and process all its children
#
sub proc_node {
my( $node, $dist ) = @_;
return unless( $node->nodeType eq &XML_ELEMENT_NODE );
$dist->{ $node->nodeName } ++;
foreach my $child ( $node->getChildnodes ) {
&proc_node( $child, $dist );
}
}
Note that instead of using a simple path to a file, we use a filehandle object of the IO::Handle
class. Perl filehandles, as you probably know, are magic and subtle beasties, capable of passing into your code characters from a wide variety of sources, including files on disk, open network sockets, keyboard input, databases, and just about everything else capable of outputting data. Once you define a filehandle's source, it gives you the same interface for reading from it as does every other filehandle. This dovetails nicely with our XML-based ideology, where we want code to be as flexible and reusable as possible. After all, XML doesn't care where it comes from, so why should we pigeonhole it with one source type?
The parser object returns a document object after parsing. This object has a method that returns a reference to the document element--the element at the very root of the whole tree. We take this reference and feed it to a recursive subroutine, proc_node( ), which happily munches on elements and scribbles into a hash variable every time it sees an element. Recursion is an efficient way to write programs that process XML because the structure of documents is somewhat fractal: the same rules for elements apply at any depth or position in the document, including the root element that represents the entire document (modulo its prologue). Note the "node type" check, which distinguishes between elements and other parts of a document (such as pieces of text or processing instructions).
For every element the routine looks at, it has to call the object's getChildnodes( ) method to continue processing on its children. This call is an essential difference between stream-based and tree-based methodologies. Instead of having an event stream take the steering wheel of our program and push data at it, thus calling subroutines and codeblocks in a (somewhat) unpredictable order, our program now has the responsibility of navigating through the document under its own power. Traditionally, we start at the root element and go downward, processing children in order from first to last. However, because we, not the parser, are in control now, we can scan through the document in any way we want. We could go backwards, we could scan just a part of the document, we could jump around, making multiple passes though the tree--the sky's the limit. Here's the result from processing a small chapter coded in DocBook XML:
$ xfreq < ch03.xml
chapter: 1
citetitle: 2
firstterm: 16
footnote: 6
foreignphrase: 2
function: 10
itemizedlist: 2
listitem: 21
literal: 29
note: 1
orderedlist: 1
para: 77
programlisting: 9
replaceable: 1
screen: 1
section: 6
sgmltag: 8
simplesect: 1
systemitem: 2
term: 6
title: 7
variablelist: 1
varlistentry: 6
xref: 2
The result shows only a few lines of code, but it sure does a lot of work. Again, thanks to the C library underneath, it's quite speedy.
XML::XPath
We've seen examples of parsers that dutifully deliver the entire document to you. Often, though, you don't need the whole thing. When you query a database, you're usually looking for only a single record. When you crack open a telephone book, you're not going to sit down and read the whole thing. There is obviously a need for some mechanism of extracting a specific piece of information from a vast document. Look no further than XPath.
XPath is a recommendation from the folks who brought you XML.[2] It's a grammar for writing expressions that pinpoint specific pieces of documents. Think of it as an addressing scheme. Although we'll save the nitty-gritty of XPath wrangling for Chapter 8, we can tantalize you by revealing that it works much like a mix of regular expressions with Unix-style file paths. Not surprisingly, this makes it an attractive feature to add to parsers.
Matt Sergeant's XML::XPath
module is a solid implementation, built on the foundation of XML::Parser
. Given an XPath expression, it returns a list of all document parts that match the description. It's an incredibly simple way to perform some powerful search and retrieval work.
For instance, suppose we have an address book encoded in XML in this basic form:
<contacts>
<entry>
<name>Bob Snob</name>
<street>123 Platypus Lane</street>
<city>Burgopolis</city>
<state>FL</state>
<zip>12345</zip>
</entry>
<!--More entries go here-->
</contacts>
Suppose you want to extract all the zip codes from the file and compile them into a list. Example 3-7 shows how you could do it with XPath.
Example 3-7: Zip code extractor
use XML::XPath;
my $file = 'customers.xml';
my $xp = XML::XPath->new(filename=>$file);
# An XML::XPath nodeset is an object which contains the result of
# smacking an XML document with an XPath expression; we'll do just
# this, and then query the nodeset to see what we get.
my $nodeset = $xp->find('//zip');
my @zipcodes; # Where we'll put our results
if (my @nodelist = $nodeset->get_nodelist) {
# We found some zip elements! Each node is an object of the class
# XML::XPath::Node::Element, so I'll use that class's 'string_value'
# method to extract its pertinent text, and throw the result for all
# the nodes into our array.
@zipcodes = map($_->string_value, @nodelist);
# Now sort and prepare for output
@zipcodes = sort(@zipcodes);
local $" = "\n";
print "I found these zipcodes:\n@zipcodes\n";
} else {
print "The file $file didn't have any 'zip' elements in it!\n";
}
Run the program on a document with three entries and we'll get something like this:
I found these zipcodes:
03642
12333
82649
This module also shows an example of tree-based parsing, by the way, as its parser loads the whole document into an object tree of its own design and then allows the user to selectively interact with parts of it via XPath expressions. This example is just a sample of what you can do with advanced tree processing modules. You'll see more of these modules in Chapter 8.
XML::LibXML
's element objects support a findnodes( )
method that works much like XML::XPath
's, using the invoking Element
object as the current context and returning a list of objects that match the query. We'll play with this functionality later in Chapter 10.
1. For downloads and documentation, see https://www.libxml.org/. (back)
2. Check out the specification at https://www.w3.org/TR/xpath. (back)
[previous] [next] |
Created: May 8, 2002
Revised: May 8, 2002
URL: https://webreference.com/programming/perl/perlxml/chap3/8.html