Refactoring HTML: Improving the Design of Existing Web Applications - Part 3
Refactoring HTML: Well-Formedness - Part 3
Elliotte Rusty Harold
Convert Text to UTF-8
Reencode all text as Unicode UTF-8.
Motivation
Pages that use any content except basic ASCII have cross-platform display problems. Windows encodings are not interpreted correctly on the Mac and vice versa. Web browsers guess what encoding they think a page is in, but they often guess wrong.
UTF-8 is a standard encoding that works across all web browsers and is supported by all major text editors and other tools. It is reasonably fast, small, and efficient. It can support all Unicode characters and is a good basis for internationalization and localization of pages.
Potential Trade-offs
You need to be able to control your web servers HTTP response headers to properly implement this. This can be problematic in shared hosting environments. Bad tools do not always recognize UTF-8 when they should.
Mechanics
There are two steps here. First, reencode all content in UTF-8. Second, tell clients that you've done that. Reencoding is straightforward, provided that you know what encoding you're starting with. You have to tell Tidy that you want UTF-8, but once you do, it will do the work:
TagSoup you don't have to tell. It just produces UTF-8 by default.
A number of command-line tools and other programs will also save content in UTF-8 if you ask, such as GNU recode, BBEdit, and jEdit. You should also set your editor of choice to save in UTF-8 by default.
The next step is to tell the browsers that the content is in UTF-8. There are three parts to this.
- Add a byte order mark.
- Add a meta tag.
- Specify the Content-type header.
The byte order mark is Unicode character 0xFEFF, the zero-width space. When this is the first character in a document, the browser should recognize the byte sequence and treat the rest of the content as UTF-8. This shouldn't be necessary, but Internet Explorer and some other tools are more reliable if they have it. Some editors add this automatically and some require you to request it.
The second step is to add a meta tag in the head, such as this one:
The charset=UTF-8 part warns browsers that they're dealing with UTF-8 if they havent figured it out already.
Finally, you want to configure the web server so that it too specifies that the content is UTF-8. This can be tricky. It requires access to your servers configuration files or the ability to override the configuration locally. This may not be possible on a shared host, but it should be possible on a professionally managed server. On Apache, you can do this by adding the following line to your httpd.conf file or your .htaccess file within the content directory:
You really shouldn't have to do all three of these. One should be enough. However, in practice, some tools recognize one of these hints but not the others, and the redundancy doesn't hurt as long as you're consistent.
I do not recommend adding an XML declaration. XML parsers don't need it, and it will confuse some browsers.
URL: