WebReference.com - Chapter 3 from Perl & XML, from O'Reilly and Associates (11/12)

[previous] [next]

Perl & XML

Character Sets and Encodings

No matter how you choose to manage your program's output, you must keep in mind the concept of character encoding--the protocol your output XML document uses to represent the various symbols of its language, be they an alphabet of letters or a catalog of ideographs and diacritical marks. Character encoding may represent the trickiest part of XML-slinging, perhaps especially so for programmers in Western Europe and the Americas, most of whom have not explored the universe of possible encodings beyond the 128 characters of ASCII.

While it's technically legal for an XML document's encoding declaration to contain the name of any text encoding scheme, the only ones that XML processors are, according to spec, required to understand are UTF-8 and UTF-16. UTF-8 and UTF-16 are two flavors of Unicode, a recent and powerful character encoding architecture that embraces every funny little squiggle a person might care to make.

In this section, we conspire with Perl and XML to nudge you gently into thinking about Unicode, if you're not pondering it already. While you can do everything described in this book by using the legacy encoding of your choice, you'll find, as time passes, that you're swimming against the current.

Unicode, Perl, and XML

Unicode has crept in as the digital age's way of uniting the thousands of different writing systems that have paid the salaries of monks and linguists for centuries. Of course, if you program in an environment where non-ASCII characters are found in abundance, you're probably already familiar with it. However, even then, much of your text processing work might be restricted to low-bit Latin alphanumerics, simply because that's been the character set of choice--of fiat, really--for the Internet. Unicode hopes to change this trend, Perl hopes to help, and sneaky little XML is already doing so.

As any Unicode-evangelizing document will tell you,^[1] Unicode is great for internationalizing code. It lets programmers come up with localization solutions without the additional worry of juggling different character architectures.

However, Unicode's importance increases by an order of magnitude when you introduce the question of data representation. The languages that a given program's users (or programmers) might prefer is one thing, but as computing becomes more ubiquitous, it touches more people's lives in more ways every day, and some of these people speak Kurku. By understanding the basics of Unicode, you can see how it can help to transparently keep all the data you'll ever work with, no matter the script, in one architecture.

Unicode Encodings

We are careful to separate the words "architecture" and "encoding" because Unicode actually represents one of the former that contains several of the latter.

In Unicode, every discrete squiggle that's gained official recognition, from A to α, has its own code point--a unique positive integer that serves as its address in the whole map of Unicode. For example, the first letter of the Latin alphabet, capitalized, lives at the hexadecimal address 0x0041 (as it does in ASCII and friends), and the other two symbols, the lowercase Greek alpha and the smileyface, are found in 0x03B1 and 0x263A, respectively. A character can be constructed from any one of these code points, or by combining several of them. Many code points are dedicated to holding the various diacritical marks, such as accents and radicals, that many scripts use in conjunction with base alphabetical or ideographic glyphs.

These addresses, as well as those of the tens of thousands (and, in time, hundreds of thousands) of other glyphs on the map, remain true across Unicode's encodings. The only difference lies in the way these numbers are encoded in the ones and zeros that make up the document at its lowest level.

Unicode officially supports three types of encoding, all named UTF (short for Unicode Transformation Format), followed by a number representing the smallest bit-size any character might take. The encodings are UTF-8, UTF-16, and UTF-32. UTF-8 is the most flexible of all, and is therefore the one that Perl has adopted.

UTF-8

The UTF-8 encoding, arguably the most Perlish in its impish trickery, is also the most efficient since it's the only one that can pack characters into single bytes. For that reason, UTF-8 is the default encoding for XML documents: if XML documents specify no encoding in their declarations, then processors should assume that they use UTF-8.

Each character appearing within a document encoded with UTF-8 uses as many bytes as it has to in order to represent that character's code point, up to a maximum of six bytes. Thus, the character A, with the itty-bitty address of 0x41, gets one byte to represent it, while our friend lives way up the street in one of Unicode's blocks of miscellaneous doohickeys, with the address 0x263A. It takes three bytes for itself--two for the character's code point number and one that signals to text processors that there are, in fact, multiple bytes to this character. Several centuries from now, after Earth begrudgingly joins the Galactic Friendship Union and we find ourselves needing to encode the characters from countless off-planet civilizations, bytes four through six will come in quite handy.

UTF-16

The UTF-16 encoding uses a full two bytes to represent the character in question, even if its ordinal is small enough to fit into one (which is how UTF-8 would handle it). If, on the other hand, the character is rare enough to have a very high ordinal, then it gets an additional two bytes tacked onto it (called a surrogate pair), bringing that one character's total length to four bytes.

TIP: Because Unicode 2.0 used a 16-bits-per-character style as its sole supported encoding, many people, and the programs they write, talk about the "Unicode encoding" when they really mean Unicode UTF-16. Even new applications' "Save As..." dialog boxes sometimes offer "Unicode" and "UTF-8" as separate choices, even though these labels don't make much sense in Unicode 3.2 terminology.

UTF-32

UTF-32 works a lot like UTF-16, but eliminates any question of variable character size by declaring that every invoked Unicode-mapped glyph shall occupy exactly four bytes. Because of its maximum maximosity, this encoding doesn't see much practical use, since all but the most unusual communication would have significantly more than half of its total mass made up of leading zeros, which doesn't work wonders for efficiency. However, if guaranteed character width is an inflexible issue, this encoding can handle all the million-plus glyph addresses that Unicode accommodates. Of the three major Unicode encodings, UTF-32 is the one that XML parsers aren't obliged to understand. Hence, you probably don't need to worry about it, either.

Other Encodings

The XML standard defines 21 names for character sets that parsers might use (beyond the two they're required to know, UTF-8 and UTF-16). These names range from ISO-8859-1 (ASCII plus 128 characters outside the Latin alphabet) to Shift_JIS, a Microsoftian encoding for Japanese ideographs. While they're not Unicode encodings per se, each character within them maps to one or more Unicode code points (and vice versa, allowing for round-tripping between common encodings by way of Unicode).

XML parsers in Perl all have their own ways of dealing with other encodings. Some may need an extra little nudge. XML::Parser, for example, is weak in its raw state because its underlying library, Expat, understands only a handful of non-Unicode encodings. Fortunately, you can give it a helping hand by installing Clark Cooper's XML::Encoding module, an XML::Parser subclass that can read and understand map files (themselves XML documents) that bind the character code points of other encodings to their Unicode addresses.

Core Perl support

As with XML, Perl's relationship with Unicode has heated up at a cautious but inevitable pace.^[2] Generally, you should use Perl version 5.6 or greater to work with Unicode properly in your code. If you do have 5.6 or greater, consult its perlunicode manpage for details on how deep its support runs, as each release since then has gradually deepened its loving embrace with Unicode. If you have an even earlier Perl, whew, you really ought to consider upgrading it. You can eke by with some of the tools we'll mention later in this chapter, but hacking Perl and XML means hacking in Unicode, and you'll notice the lack of core support for it.

Currently, the most recent stable Perl release, 5.6.1, contains partial support for Unicode. Invoking the use utf8 pragma tells Perl to use UTF-8 encoding with most of its string-handling functions. Perl also allows code to exist in UTF-8, allowing identifiers built from characters living beyond ASCII's one-byte reach. This can prove very useful for hackers who primarily think in glyphs outside the Latin alphabet.

Perl 5.8's Unicode support will be much more complete, allowing UTF-8 and regular expressions to play nice. The 5.8 distribution also introduces the Encode module to Perl's standard library, which will allow any Perl programmer to shift text from legacy encodings to Unicode without fuss:

use Encode 'from_to';
from_to($data, "iso-8859-3", "utf-8"); # from legacy to
utf-8

Finally, Perl 6, being a redesign of the whole language that includes everything the Perl community learned over the last dozen years, will naturally have an even more intimate relationship with Unicode (and will give us an excuse to print a second edition of this book in a few years). Stay tuned to the usual information channels for continuing developments on this front as we see what happens.

1. These documents include Chapter 15 of O'Reilly's Programming Perl, Third Edition and the FAQ that the Unicode consortium hosts at https://unicode.org/unicode/faq/. (back)

2. The romantic metaphor may start to break down for you here, but you probably understand by now that Perl's polyamorous proclivities help make it the language that it is. (back)