WebReference.com - Chapter 3 from Perl & XML, from O'Reilly and Associates (3/12)
[previous] [next] |
Perl & XML
Example (of What Not to Do): A Well-Formedness Checker
We've described XML parsers abstractly, but now it's time to get our hands dirty. We're going to write our own parser whose sole purpose is to check whether a document is well-formed XML or if it fails the basic test. This is about the simplest a parser can get; it doesn't drive any further processing, but just returns a "yes" or "no."
Our mission here is twofold. First, we hope to shave some of the mystique off of XML processing--at the end of the day, it's just pushing text around. However, we also want to emphasize that writing a proper parser in Perl (or any language) requires a lot of work, which would be better spent writing more interesting code that uses one of the many available XML-parsing Perl modules. To that end, we'll write only a fraction of a pure-Perl XML parser with a very specific goal in mind.
WARNING: Feel free to play with this program, but please don't try to use this code in a production environment! It's not a real Perl and XML solution, but an illustration of the sorts of things that parsers do. Also, it's incomplete and will not always give correct results, as we'll show later. Don't worry; the rest of this book talks about real XML parsers and Perl tools you'll want to use.
The program is a loop in which regular expressions match XML markup objects and pluck them out of the text. The loop runs until nothing is left to remove, meaning the document is well formed, or until the regular expressions can't match anything in the remaining text, in which case it's not well-formed. A few other tests could abort the parsing, such as when an end tag is found that doesn't match the name of the currently open start tag. It won't be perfect, but it should give you a good idea of how a well-formedness parser might work.
Example 3-1 is a routine that parses a string of XML text, tests to see if it is well-formed, and returns a boolean value. We've added some pattern variables to make it easier to understand the regular expressions. For example, the string $ident
contains regular expression code to match an XML identifier, which is used for elements, attributes, and processing instructions.
Example 3-1: A rudimentary XML parser
sub is_well_formed {
my $text = shift; # XML text to check
# match patterns
my $ident = '[:_A-Za-z][:A-Za-z0-9\-\._]*'; # identifier
my $optsp = '\s*'; # optional space
my $att1 = "$ident$optsp=$optsp\"[^\"]*\""; # attribute
my $att2 = "$ident$optsp=$optsp'[^']*'"; # attr. variant
my $att = "($att1|$att2)"; # any attribute
my @elements = ( ); # stack of open elems
# loop through the string to pull out XML markup objects
while( length($text) ) {
# match an empty element
if( $text =~ /^&($ident)(\s+$att)*\s*\/>/ ) {
$text = $';
# match an element start tag
} elsif( $text =~ /^&($ident)(\s+$att)*\s*>/ ) {
push( @elements, $1 );
$text = $';
# match an element end tag
} elsif( $text =~ /^&\/($ident)\s*>/ ) {
return unless( $1 eq pop( @elements ));
$text = $';
# match a comment
} elsif( $text =~ /^&!--/ ) {
$text = $';
# bite off the rest of the comment
if( $text =~ /-->/ ) {
$text = $';
return if( $` =~ /--/ ); # comments can't
# contain '--'
} else {
return;
}
# match a CDATA section
} elsif( $text =~ /^&!\[CDATA\[/ ) {
$text = $';
# bite off the rest of the comment
if( $text =~ /\]\]>/ ) {
$text = $';
} else {
return;
}
# match a processing instruction
} elsif( $text =~ m|^&\?$ident\s*[^\?]+\?>| ) {
$text = $';
# match extra whitespace
# (in case there is space outside the root element)
} elsif( $text =~ m|^\s+| ) {
$text = $';
# match character data
} elsif( $text =~ /(^[^&&>]+)/ ) {
my $data = $1;
# make sure the data is inside an element
return if( $data =~ /\S/ and not( @elements ));
$text = $';
# match entity reference
} elsif( $text =~ /^&$ident;+/ ) {
$text = $';
# something unexpected
} else {
return;
}
}
return if( @elements ); # the stack should be empty
return 1;
}
Perl's arrays are so useful partly due to their ability to masquerade as more abstract computer science data constructs.[1] Here, we use a data structure called a stack, which is really just an array that we access with push( ) and pop( ). Items in a stack are last-in, first-out (LIFO), meaning that the last thing put into it will be the first thing to be removed from it. This arrangement is convenient for remembering the names of currently open elements because at any time, the next element to be closed was the last element pushed onto the stack. Whenever we encounter a start tag, it will be pushed onto the stack, and it will be popped from the stack when we find an end tag. To be well-formed, every end tag must match the previous start tag, which is why we need the stack.
The stack represents all the elements along a branch of the XML tree, from the root down to the current element being processed. Elements are processed in the order in which they appear in a document; if you view the document as a tree, it looks like you're going from the root all the way down to the tip of a branch, then back up to another branch, and so on. This is called depth-first order, the canonical way all XML documents are processed.
There are a few places where we deviate from the simple looping scheme to do some extra testing. The code for matching a comment is several steps, since it ends with a three-character delimiter, and we also have to check for an illegal string of dashes "--
" inside the comment. The character data matcher, which performs an extra check to see if the stack is empty, is also noteworthy; if the stack is empty, that's an error because nonwhitespace text is not allowed outside of the root element. Here is a short list of well-formedness errors that would cause the parser to return a false result:
- An identifier in an element or attribute is malformed (examples: "
12foo
," "-bla
," and "..
"). - A nonwhitespace character is found outside of the root element.
- An element end tag doesn't match the last discovered start tag.
- An attribute is unquoted or uses a bad combination of quote characters.
- An empty element is missing a slash character (
/
) at the end of its tag. - An illegal character, such as a lone ampersand (
&
) or an angle bracket (<
), is found in character data. - A malformed markup tag (examples: "
<fooby<
" and "< ?bubba?>
") is found.
Try the parser out on some test cases. Probably the simplest complete, well-formed XML document you will ever see is this:
<:-/>
The next document should cause the parser to halt with an error. (Hint: look at the <message>
end tag.)
<memo>
<to>self</to>
<message>Don't forget to mow the car and wash the
lawn.<message>
</memo>
Many other kinds of syntax errors could appear in a document, and our program picks up most of them. However, it does miss a few. For example, there should be exactly one root element, but our program will accept more than one:
<root>I am the one, true root!</root>
<root>No, I am!</root>
<root>Uh oh...</root>
Other problems? The parser cannot handle a document type declaration. This structure is sometimes seen at the top of a document that specifies a DTD for validating parsers, and it may also declare some entities. With a specialized syntax of its own, we'd have to write another loop just for the document type declaration.
Our parser's most significant omission is the resolution of entity references. It can check basic entity reference syntax, but doesn't bother to expand the entity and insert it into the text. Why is that bad? Consider that an entity can contain more than just some character data. It can contain any amount of markup, too, from an element to a big, external file. Entities can also contain other entity references, so it might require many passes to resolve one entity reference completely. The parser doesn't even check to see if the entities are declared (it couldn't anyway, since it doesn't know how to read a document type declaration syntax). Clearly, there is a lot of room for errors to creep into a document through entities, right under the nose of our parser. To fix the problems just mentioned, follow these steps:
- Add a parsing loop to read in a document type declaration before any other parsing occurs. Any entity declarations would be parsed and stored, so we can resolve entity references later in the document.
- Parse the DTD, if the document type declaration mentions one, to read any entity declarations.
- In the main loop, resolve all entity references when we come across them. These entities have to be parsed, and there may be entity references within them, too. The process can be rather loopy, with loops inside loops, recursion, or other complex programming stunts.
What started out as a simple parser now has grown into a complex beast. That tells us two things: that the theory of parsing XML is easy to grasp; and that, in practice, it gets complicated very quickly. This exercise was useful because it showed issues involved in parsing XML, but we don't encourage you to write code like this. On the contrary, we expect you to take advantage of the exhaustive work already put into making ready-made parsers. Let's leave the dark ages and walk into the happy land of prepackaged parsers.
1. The O'Reilly book Mastering Algorithms with Perl by Jon Orwant, Jarkko Hietaniemi, and John Macdonald devotes a chapter to this topic. (back)
[previous] [next] |
Created: May 8, 2002
Revised: May 8, 2002
URL: https://webreference.com/programming/perl/perlxml/chap3/3.html