January 26, 2000 - Unicode | WebReference

January 26, 2000 - Unicode

Yehuda Shiran January 26, 2000
Unicode
Tips: January 2000

Yehuda Shiran, Ph.D.
Doc JavaScript

The Unicode standard is a fixed-width uniform encoding scheme. Its target usage is for interchange and display of many different languages, as well as historic scripts, technical and mathematical symbols, and multilingual texts. The Unicode standard specifies the identity of the character and its numeric value. The 16-bit numeric value is defined by a hexadecimal number and a prefix \u (backslash followed by a lowercase u). The Unicode value \u0041, for example, represents the character A. The Unicode unique name for this character is LATIN CAPITAL LETTER A.

Unicode is compatible with ASCII characters. The first 128 Unicode characters correspond to the ASCII characters and have the same numeric value. ASCII's 0x41 is the same as Unicode's \u0041. While ASCII's 128 characters supports just the Latin alphabet, Unicode's over 65,000 characters can support many different languages. Unicode is fully compatible with ISO's 10646-1 and UCS-2 standards. JavaScript programs will still be written in the ASCII-set characters. You can use non-ASCII Unicode characters in the comments and string literals of JavaScript.

The calculator below accepts a Unified code value (just the four hexadecimal characters, no \u) and prints the corresponding character in the middle of the following sentence: "Unicode Demo:Netscape Corporation"

Here are some common special characters and their Unicode value:

Unicode ValueNameSymbol
\u0009Tab<TAB>
\u000BVertical Tab<TAB>
\u000CForm Feed<FF>
\u0020Space<SP>
\u000ALine Feed<LF>
\u000DCarriage Return<CR>
\u0022Double Quote<TAB>
\u0027Single Quote<'>
\u005CBackslash<\>

You can play with our Unicode calculator above and find many Unicode values that yield unexpected characters. Although Unicode can support more than 65,000 different characters, it is up to your browser to provide the Unicode fonts. Often, Unicode fonts do not display all the Unicode characters. In addition to the client's (browser's) support, the client platform must support Unicode as well. Some platforms, such as Windows 95, provide only partial support for Unicode.

The other problem with Unicode is how to enter Non-ASCII characters. Often, the only way to specify Unicode characters is by using Unicode escape sequences as shown in the table above. Unicode specification, though, requires that composite characters must be specified by a sequence of Unicode characters led by the base one. Many French characters, for example, are built on top of the Latin character set with additional hyphens, carets, apostrophes, etc. The Unicode specification requires that such characters must be specified by the Latin character, followed by the apostrophes' (for example) Unicode value. The JavaScript implementation, like other ones, does not support this option. No combining sequences are interpreted by JavaScript. A Unicode escape sequence for each French character is used instead.

Unicode support was introduced in JavaScript 1.3. Learn more about the features of JavaScript 1.3 in Column 25, JavaScript 1.3 Overview, Part I, and Column 26, JavaScript 1.3 Overview, Part II.