WebReference.com - Chapter 3 from Perl & XML, from O'Reilly and Associates (12/12)

[previous]

Perl & XML

Encoding Conversion

If you use a version of Perl older than 5.8, you'll need a little extra help when switching from one encoding to another. Fortunately, your toolbox contains some ratchety little devices to assist you.

iconv and Text::Iconv

iconv is a library and program available for Windows and Unix (inlcuding Mac OS X) that provides an easy interface for turning a document of type A into one of type B. On the Unix command line, you can use it like this:

$ iconv -f latin1 -t utf8 my_file.txt > my_unicode_file.txt

If you have iconv on your system, you can also grab the Text::Iconv Perl module from CPAN, which gives you a Perl API to this library. This allows you to quickly re-encode on-disk files or strings in memory.

Unicode::String

A more portable solution comes in the form of the Unicode::String module, which needs no underlying C library. The module's basic API is as blissfully simple as all basic APIs should be. Got a string? Feed it to the class's constructor method and get back an object holding that string, as well as a bevy of methods that let you squash and stretch it in useful and amusing ways. Example 3-12 tests the module.

Example 3-12: Unicode test

use Unicode::String;
my $string = "This sentence exists in ASCII and UTF-8, but not UTF-16. Darn!\n";
my $u = Unicode::String->new($string);
# $u now holds an object representing a stringful of 16-bit characters
# It uses overloading so Perl string operators do what you expect!
$u .= "\n\nOh, hey, it's Unicode all of a sudden. Hooray!!\n"
# print as UTF-16 (also known as UCS2)
print $u->ucs2;
# print as something more human-readable
print $u->utf8;

The module's many methods allow you to downgrade your strings, too--specifically, the utf7 method lets you pop the eighth bit off of UTF-8 characters, which is acceptable if you need to throw a bunch of ASCII characters at a receiver that would flip out if it saw chains of UTF-8 marching proudly its way instead of the austere and solitary encodings of old.

WARNING: XML::Parser sometimes seems a little too eager to get you into Unicode. No matter what a document's declared encoding is, it silently transforms all characters with higher Unicode code points into UTF-8, and if you ask the parser for your data back, it delivers those characters back to you in that manner. This silent transformation can be an unpleasant surprise. If you use XML::Parser as the core of any processing software you write, be aware that you may need to use the convertion tools mentioned in this section to massage your data into a more suitable format.

Byte order marks

If, for some reason, you have an XML document from an unknown source and have no idea what its encoding might be, it may behoove you to check for the presence of a byte order mark (BOM) at the start of the document. Documents that use Unicode's UTF-16 and UTF-32 encodings are endian-dependent (while UTF-8 escapes this fate by nature of its peculiar protocol). Not knowing which end of a byte carries the significant bit will make reading these documents similar to reading them in a mirror, rendering their content into a garble that your programs will not appreciate.

Unicode defines a special code point, U+FEFF, as the byte order mark. According to the Unicode specification, documents using the UTF-16 or UTF-32 encodings have the option of dedicating their first two or four bytes to this character.^[1] This way, if a program carefully inspecting the document scans the first two bits and sees that they're 0xFE and 0xFF, in that order, it knows it's big-endian UTF-16. On the other hand, if it sees 0xFF 0xFE, it knows that document is little-endian because there is no Unicode code point of U+FFFE. (UTF-32's big- and little-endian BOMs have more padding: 0x00 0x00 0xFE 0xFF and 0xFF 0xFE 0x00 0x00, respectively.)

The XML specification states that UTF-16- and UTF-32-encoded documents must use a BOM, but, referring to the Unicode specification, we see that documents created by the engines of sane and benevolent masters will arrive to you in network order. In other words, they arrive to you in a big-endian fashion, which was some time ago declared as the order to use when transmitting data between machines. Conversely, because you are sane and benevolent, you should always transmit documents in network order when you're not sure which order to use. However, if you ever find yourself in doubt that you've received a sane document, just close your eyes and hum this tune:

open XML_FILE, $filename or die "Can't read $filename: $!";
my $bom; # will hold possible byte order mark
# read the first two bytes
read XML_FILE, $bom, 2;
# Fetch their numeric values, via Perl's ord() function
my $ord1 = ord(substr($bom,0,1));
my $ord2 = ord(substr($bom,1,1));
if ($ord1 == 0xFE && $ord2 == 0xFF) {
  # It looks like a UTF-16 big-endian document!
  # ... act accordingly here ...
} elsif ($ord1 == 0xFF && $ord2 == 0xEF) {
  # Oh, someone was naughty and sent us a UTF-16 little-endian document.
  # Probably we'll want to effect a byteswap on the thing before working with it.
} else {
  # No byte order mark detected.
}

You might run this example as a last-ditch effort if your parser complains that it can't find any XML in the document. The first line might indeed be a valid <?xml ... > declaration, but your parser sees some gobbledygook instead.

1. UTF-8 has its own byte order mark, but its purpose is to identify the document at UTF-8, and thus has little use in the XML world. The UTF-8 encoding doesn't have to worry about any of this endianness business since all its characters are made of strung-together byte sequences that are always read from first to last instead of little boxes holding byte pairs whose order may be questionable. (back)