HTML Unleashed. SGML and the HTML DTD: Document Type Definition for HTML 4.0


	HTML Unleashed: SGML and the HTML DTD
	Document Type Definition for HTML 4.0

ow that we've examined the SGML declaration and found answers to a number of general questions about HTML formation, it's time to get to the details of its tags, entities, and the related document structure. All of this is defined in the document type definition (DTD) for HTML 4.0.

The HTML DTD analyzed in this chapter is too long to be listed in its entirety. Instead of going through the DTD from top to bottom, I discuss the major concepts and syntax features of an SGML DTD in their logical order exemplifying them by excerpts from the HTML 4.0 DTD. This approach will enable you to understand any given part of the DTD without the chapter being too encumbered.

Entities

Before we start investigating elements that form an HTML document and the tags that delimit these elements, let's discuss another SGML concept named entities. If tags can be likened to named styles in word processors, then entities are a direct analog of macros that may expand to text strings or markup instructions.

In HTML documents, entities are used to invoke characters that either are absent on a computer keyboard (such as é) or have special meaning and thus cannot be typed directly (such as <). In the DTD itself, as you'll see later, entities play a more important role helping to make all sorts of declarations more concise and readable. The entities used in DTD are called parameter entities, as opposed to general entities intended for use in HTML documents and not in DTD. These two types of entities are declared in a slightly different manner, as shown in the next three sections.

Parameter Entities

The very first declaration in the HTML 4.0 DTD is an entity that expands into a formal reference (in this case, a URL) of the DTD:

<!ENTITY % HTML.Version
 "https://www.w3.org/pub/WWW/MarkUp/Cougar/Cougar.dtd"
   -- Typical usage:
      <!DOCTYPE HTML SYSTEM
      "https://www.w3.org/pub/WWW/MarkUp/Cougar/Cougar.dtd">
      <html>
      ...
      </html>
   --
  >

Let's consider, in this example, the syntax of an entity declaration. It uses the ENTITY statement that, like all other SGML statements, requires a ! after the start delimiter <. After the ENTITY keyword comes the % character indicating that the entity in question is a parameter entity rather than a general entity.

Separated from % by one or more spaces is the entity name that is later used to invoke the entity. Note that the name contains a period, thus making use of the NAMING section settings in the SGML declaration. Also recollect that entity names are different from element names in that they are case sensitive.

The last obligatory component of an entity declaration is the string enclosed in quotation marks (data string) that shows what this entity stands for and what it will expand to when invoked. Here's how the entity we have defined can be used later in the DTD:

%HTML.Version;

Note that this time, there is no space between the % and the entity name. The trailing semicolon may be omitted in certain contexts.

Unfortunately, this part of SGML syntax clashes with one of Netscape HTML extensions, namely using the % character for specifying sizes of images and other elements as percentages of window dimensions. This is why HTML validators that check a HTML document against a DTD sometimes have trouble with this feature.

Public Identifiers

The last part of the %HTML.Version; entity declaration is the comment that reminds us about the necessity (unambiguously stated in HTML specification) to start any HTML document that is intended to be a valid SGML document with a DOCTYPE declaration. This allows an SGML parser to know at once that the structure and tags of the document it's about to process are described in the DTD identified (in our case) by its URL. Of course, HTML (not SGML) browsers could also make use of this information to select the level of HTML support needed for the document (although only a few of them really do).

The use of a URL as a DTD identifier is rather unusual (it is probably explained by the fact that, at the time of this writing, HTML 4.0 DTD was still evolving). More often, to refer to external information sources, SGML documents use public identifiers of a special form. For example, in HTML 3.2 DTD the %HTML.Version; entity expands into the string -//W3C//DTD HTML 3.2 Final//EN which is the public identifier of this version's DTD. Another example is the identifier string of a character set standard used for the BASESET parameter in SGML declaration (see "CHARSET Section," earlier in this chapter). Any DTD or related standard has a unique public identifier assigned in order to allow referring to this standard from other SGML documents. Such references are usually made via parameter entities.

If the data string in an entity declaration is preceded by the additional keyword PUBLIC, this means that the string is not the entity value but a public identifier pointing to an external information source. For example, the HTML 4.0 DTD is accompanied by a set of general entities for accessing characters of ISO Latin-1 isolated in a separate document with its unique public identifier. Here's how this document is incorporated into HTML DTD:

<!ENTITY % HTMLlat1 PUBLIC
   "-//W3C//ENTITIES Latin1//EN//HTML">

Here, the string in quotes contains the public identifier of the external resource whose contents will be substituted for each occurrence of the entity %HTMLlat1;. To make the mentioned document part of the DTD, it is now enough to invoke the defined entity (actually this is done right after its declaration):

%HTMLlat1;

Formal rules for constructing public identifiers need not be detailed here. There exists a fairly complete catalog of public identifiers.

General Entities

General entities are declared in the DTD similarly to parameter entities, but they have a number of differences:

A general entity cannot be used in the DTD but only in the documents conforming to this document type (in our case, the HTML documents).
A general entity does not have the % character in its declaration.
A general entity is invoked using the & character rather than the % character used for parameter entities---for instance,
```
&lt;
```
As with general entities, the trailing semicolon may be sometimes omitted (although I wouldn't recommend doing this).
A general entity usually contains the CDATA keyword, inserted in its declaration before the data string. This keyword indicates that the string should not be interpreted as SGML data; that is, any markup instructions it might contain should be ignored and treated as ordinary text characters.

For an example, consider the entity declarations provided in the DTD for accessing four special characters:

<!ENTITY amp  CDATA "&#38;" -- ampersand    -->
<!ENTITY gt   CDATA "&#62;" -- greater than -->
<!ENTITY lt   CDATA "&#60;" -- less than    -->
<!ENTITY quot CDATA "&#34;" -- double quote -->

This example also shows us one special kind of entity called character reference that does not require any declaration. If the entity opening delimiter & is immediately followed by the # character and a number, this number is interpreted as character code (from the document character set as defined in the SGML declaration, see "CHARSET Section") and the whole entity is replaced by the character having this code. This is one of the two methods to access characters that are beyond the reach of a computer keyboard; the other method uses the mnemonic character entities defined in the DTD, such as & or é.

You might wonder how the entities in the preceding example could expand to special characters if the CDATA keyword prohibits any SGML instructions, character references included, from having effect in the data string. The answer is that this string is in fact read twice: the first time when the entity declaration is interpreted, and the second time when the entity is used in the document and its data string is substituted. The CDATA keyword affects only the first reading. As a result, the DTD is protected from the special characters, while in the document the references are expanded to the characters intended.

Elements

As I've already mentioned in "How to Define an SGML Application," a document marked up with an SGML application is thought of as consisting of a hierarchy of nested elements. A marked up element is usually enclosed in a pair of start and end tags. The ELEMENT statement in SGML defines both start and end tags (but not their attributes) and prescribes what may be the content of this element by defining its content model.

Here's an example of element declaration:

<!ELEMENT P - O (%text)*>

Here, P is the element name (short for Paragraph). The two characters following the element name are minimization indicators specifying whether it is possible to omit start and/or end tags for this element. The first indicator refers to the start tag, and the second, to the end tag.

In place of a minimization indicator, you can put either a hyphen (-), meaning that the tag is obligatory, or the letter O, meaning that the tag is omissible. Thus, the preceding statement declares that a P element (a paragraph) must be preceded by the <P> start tag, while the </P> end tag can be omitted.

It is possible to have both start and end tags omissible. For example, the declaration

<!ELEMENT HTML O O (%html.content)>

indicates that both <HTML> and </HTML> tags around the content of an HTML document can be dropped.

Content Model Keywords

The last component in the element declarations above is the content model specification. (Here, it is done via parameter entities, and to see what they expand to we should find the corresponding ENTITY statements in the DTD.) A content model declares what can, what must, and what must not go inside the element.

The simplest type of content model is specified by a single keyword from the following list:

CDATA

Stands for Character DATA. This keyword means that the SGML parser suspends its processing for the content of the element. Whatever other tags or entities are contained in the element, they won't have any effect and will be treated as ordinary data characters. The only tag that SGML parser reacts to when skipping over CDATA content is the end tag of the element that switched to CDATA mode.

HTML DTD uses CDATA content model for the obsolete elements XMP, LISTING and PLAINTEXT that were intended for inserting preformatted text into HTML document without the need to escape any special characters. Also, the CDATA mode is used for STYLE and SCRIPT elements whose content is to be processed by external programs rather than SGML parser.

RCDATA

Stands for Replaceable Character DATA. This keyword introduces content model that is only different from CDATA in that it expands all general entities and character references, but ignores markup statements. RCDATA is not used in HTML DTD.

EMPTY

Means that the content of the element is empty. Naturally, this is always accompanied by the permission to omit the end tag. For example:

<!ELEMENT IMG - O EMPTY -- Embedded image -->

ANY: Allows any markup and data characters within the element. ANY is not used in HTML DTD.

Content Model Groups

Sometimes, however, it is necessary to be more specific in defining content model of an element. This is done via content model groups whose syntax deserves a more thorough examination.

The simplest model group is one element name enclosed in parentheses, which means that the element being defined must contain one occurrence of the element specified in content model and nothing else. This is a rather artificial situation, as more often a model group contains two or more element names---for example,

<!ELEMENT HTML O O  (HEAD, BODY)>

Here, the comma between HEAD and BODY is a connector used to indicate the relations between the elements listed. Possible connectors include the following:

A comma (,) indicates that the elements listed in the content model should both be present within the element exactly in the order specified.
A vertical bar (|) is the "exclusive or" connector. It indicates that one and only one of the elements can occur. However, it is often more practical to use the "simple or" relation allowing any one, or both, or even none of the elements to be present. This is why | is often combined with the occurrence indicator *, for example:
```
<!ELEMENT APPLET - - (PARAM | %text)*>
```
Here the content model specification says that within the APPLET element, any number of PARAM elements mixed with any number of text fragments (this is what the %text; entity effectively expands to) may occur.
An ampersand (&) is the "and" connector. It indicates that all of the elements listed must occur, but in any order. It is often combined with the ? occurrence indicator. Here's how the DTD defines the %head.content; parameter entity that is later used in content model specification for the HEAD element:

<!ENTITY % head.content "TITLE & ISINDEX? & BASE?">

Here's the list of occurrence indicators used to show how many times the elements can occur in a content model:

A question mark (?) means that the element may occur either once or not at all.
A plus sign (+) means that the element may occur one or more times---for example,
```
<!ELEMENT OL - -  (LI)+>
```
This means that a OL element may consist of an arbitrary number of LI elements, but at least one must be present in any case.
An asterisk (*) means that the element may occur any number of times or not at all.

Model groups can be nested, and the occurrence indicators may apply to an entire group rather than a single element:

<!ELEMENT DL - - (DT|DD)+>

This means that within a DL (Definition List) element, at least one (but possibly more) DT or DD elements must be present.

Besides element names, you can use the #PCDATA (Parsed Character DATA) keyword in model groups. It refers to "usual" characters of the document without any markup tags and can be used to explicitly allow or disallow plain text within an element.

It is different, however, from the CDATA keyword discussed earlier. First, #PCDATA can be used only within a model group and not on its own as CDATA (that is, #PCDATA should be enclosed in parentheses even when it stands alone). And second, #PCDATA does not imply ignoring markup; if a tag is encountered in the context where only #PCDATA is allowed, a compliant SGML parser should fix an error rather than ignore this tag.

Together with the connectors and occurrence indicators listed, #PCDATA can limit the set of elements allowed inside another element without prohibiting plain text from appearing there. For example, here's how the %text; entity is defined via a number of subordinate classifying entities:

<!ENTITY % font "TT | I | B  | U | S | BIG | SMALL | SUB | SUP">
<!ENTITY % phrase "EM | STRONG | DFN | CODE | SAMP | KBD | VAR | CITE">
<!ENTITY % special
   "A | IMG | APPLET | OBJECT | FONT | BASEFONT | BR | SCRIPT |
    MAP | Q | SPAN | INS | DEL | BDO | IFRAME">
<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">
<!ENTITY % text "#PCDATA | %font | %phrase | %special | %formctrl">

Thus the %text; entity stands for, in plain English, "either a chunk of text or one of all these listed elements." Obviously, it'll most often be used with the * occurrence indicator. For an example, see how the preceding declarations are used once more to define quite a number of elements in one snap:

<!ELEMENT (%font|%phrase) - - (%text)*>

As you see here, both parameter entities and groups can be used for specifying element names in declarations, not only in their content models.

SGML syntax also allows notation of the addition or subtraction of model groups, which is very convenient if these groups are specified via entity references. For instance, the FORM element is allowed to contain anything that can occur within a block-level element (that is, an element that starts a new paragraph) except for the FORM element itself (that is, FORMs cannot be nested). Rather than define the new content group from scratch, we can make use of the already defined %block.content; entity by subtracting the single FORM element from it:

<!ELEMENT FORM - - %block.content
                        -(FORM)>

Analogously, we can sum up two model groups:

<!ELEMENT HEAD O O (%head.content)
                        +(%head.misc)>

Attributes

An element is not fully described by its name and content model. Many elements have associated attributes that serve to provide additional information for rendering the element. Attributes for each element should be declared in the DTD via ATTLIST statements.

Here's a typical attribute declaration for an element:

<!ATTLIST AREA
  shape       %SHAPE    rect      -- controls interpretation of coords --
  coords      %COORDS   #IMPLIED  -- comma separated list of values --
  href        %URL      #IMPLIED  -- this region acts as hypertext link --
  target      CDATA     #IMPLIED  -- where to render linked resource --
  nohref      (nohref)  #IMPLIED  -- this region has no action --
  alt         CDATA     #REQUIRED -- description for text only browsers --
  tabindex    NUMBER    #IMPLIED  -- position in tabbing order --
  onClick     %script   #IMPLIED  -- intrinsic event --
  onMouseOver %script   #IMPLIED  -- intrinsic event --
  onMouseOut  %script   #IMPLIED  -- intrinsic event --
  >

Right after the ATTLIST keyword, the name of the element for which we're defining attributes is specified. Next comes a number of three-component groups, each defining one attribute. The first identifier in each group is the attribute name. The other two specify the type of value for the attribute and its default value, as detailed in the next sections.

Type of Attribute Value

After the name of each attribute in the ATTLIST declaration comes a keyword describing its type. This keyword is usually taken from the following list:

CDATA: Here again, CDATA means that the value of this attribute may be any string of characters (as well as an empty string) and should be ignored by the parser. CDATA is used in situations where it is impossible to force more strict limitations on the attribute value with one of the following keywords.
NAME: This keyword indicates that the value of the attribute is a name conforming to SGML naming rules as defined by the SGML declaration. (See "Naming Rules Declaration," earlier in this chapter.) The following fragment of an ATTLIST declaration is an example:

<!ATTLIST META
...
  http-equiv NAME #IMPLIED  -- HTTP response header name --
  name       NAME #IMPLIED  -- metainformation name --
...
  >

NMTOKEN: This keyword is similar to NAME with the exception that there's no requirement to start the name with the name start character. (See "Naming Rules Declaration," earlier in this chapter.) This keyword is not used in HTML 4.0 DTD.
NUMBER: This keyword allows the parameter to take numeric values. The following ATTLIST fragment is an example:

<!ATTLIST OL -- ordered lists --
...
  compact (compact) #IMPLIED  -- reduced interitem spacing --
  start   NUMBER    #IMPLIED  -- starting sequence number --
...
  >

ID: This keyword indicates that the attribute value is an identifier satisfying two requirements: first, it is a valid SGML name (as in the case of NAME), and second, it is unique across the document (that is, it cannot be assigned to any other attribute within the same document). This value type is specified for the ID attribute of the style sheets mechanism applicable to the majority of HTML elements.

Besides these keywords, you can specify the list of possible values directly using the group notation that you've already seen applied for model groups in this chapter. Thus, in the preceding ATTLIST declaration for the OL element, the COMPACT attribute may only take as value the character string "compact" or have no value at all, as in the example

<OL START=1 COMPACT>

which is equivalent to

<OL START=1 COMPACT=COMPACT>

Here's an example from the DTD with an attribute taking one of three possible values:

<!ATTLIST table
...
     align  (left|center|right)  #IMPLIED
...
>

Default Value Specification

Finally, for each attribute in an ATTLIST declaration, either a default value is provided or a keyword is specified indicating whether this attribute is changeable and/or required. In this position, character strings need not be enclosed in parentheses (although they should be put in quotes if they contain spaces or delimiters), but the keywords require using a # escape character as in the #PCDATA keyword mentioned earlier.

Here's a part of ATTLIST for TH and TD elements showing default values for ROWSPAN and COLSPAN attributes:

<!ATTLIST (th|td)        -- header or data cell --
      ...
      rowspan NUMBER  1  -- number of rows spanned by cell --
      colspan NUMBER  1  -- number of cols spanned by cell --
      ...
>

More often, however, you'll see in place of the default value a keyword from the following list:

#FIXED

This keyword must precede the actual default value and is used to specify that the value cannot be changed by the user. It is used by the DTD only once, in the declaration for VERSION attribute of the HTML element:

<!ATTLIST HTML
      VERSION CDATA #FIXED "%HTML.Version;"
...
>

This means that the only possible value of the VERSION attribute is the string substituted for the %HTML.Version; parameter entity. (See "Parameter Entities," earlier in this chapter).

#IMPLIED

This keyword indicates that the attribute is optional.

#REQUIRED

This keyword indicates that the attribute is obligatory. For example:

<!ATTLIST PARAM
        name    CDATA    #REQUIRED  -- property name --
        value   CDATA    #IMPLIED   -- property value --
...
>

Deprecated Features

Sometimes, a part of the DTD must be processed in a different way than the rest of it. For this, SGML offers the generic mechanism of marked sections that make it possible to isolate any markup statements and declarations in order to control their processing. HTML DTD uses this mechanism to mark its deprecated features that should be avoided in documents but are kept in the DTD for backwards compatibility. Here's what a marked section looks like:

<![ %HTML.Deprecated [
 <!ENTITY % preformatted "PRE | XMP | LISTING">
]]>

The %HTML.Deprecated; entity expands into the special keyword that tells the parser what to do with the contents of the section. The two keywords used in various HTML DTDs are IGNORE and INCLUDE. The IGNORE keyword allows to ignore the marked section completely, and the INCLUDE keyword prescribes to process its contents on equal terms with the rest of DTD. So, to get a "strict" version of a DTD, all you need to do is to change the declaration