HTML Unleashed. SGML and the HTML DTD: Document Type Definition for HTML 4.0
HTML Unleashed: SGML and the HTML DTD | |
Document Type Definition for HTML 4.0 |
ow that we've examined the SGML declaration and found answers to a number of general questions about HTML formation, it's time to get to the details of its tags, entities, and the related document structure. All of this is defined in the document type definition (DTD) for HTML 4.0. The HTML DTD analyzed in this chapter is too long to be listed in its entirety. Instead of going through the DTD from top to bottom, I discuss the major concepts and syntax features of an SGML DTD in their logical order exemplifying them by excerpts from the HTML 4.0 DTD. This approach will enable you to understand any given part of the DTD without the chapter being too encumbered. |
Entities |
Before we start investigating elements that form an HTML document and the tags that delimit these elements, let's discuss another SGML concept named entities. If tags can be likened to named styles in word processors, then entities are a direct analog of macros that may expand to text strings or markup instructions. In HTML documents, entities are used to invoke characters that either are absent on a computer keyboard (such as é) or have special meaning and thus cannot be typed directly (such as <). In the DTD itself, as you'll see later, entities play a more important role helping to make all sorts of declarations more concise and readable. The entities used in DTD are called parameter entities, as opposed to general entities intended for use in HTML documents and not in DTD. These two types of entities are declared in a slightly different manner, as shown in the next three sections. |
Parameter Entities |
The very first declaration in the HTML 4.0 DTD is an entity that expands into a formal reference (in this case, a URL) of the DTD: |
<!ENTITY % HTML.Version "https://www.w3.org/pub/WWW/MarkUp/Cougar/Cougar.dtd" -- Typical usage: <!DOCTYPE HTML SYSTEM "https://www.w3.org/pub/WWW/MarkUp/Cougar/Cougar.dtd"> <html> ... </html> -- > |
Let's consider, in this example, the syntax of an entity declaration.
It uses the ENTITY statement that, like all other SGML
statements, requires a ! after the start delimiter
<. After the ENTITY keyword comes the
% character indicating that the entity in question is a
parameter entity rather than a general entity.
Separated from % by one or more spaces is the entity name that is later used to invoke the entity. Note that the name contains a period, thus making use of the NAMING section settings in the SGML declaration. Also recollect that entity names are different from element names in that they are case sensitive. The last obligatory component of an entity declaration is the string enclosed in quotation marks (data string) that shows what this entity stands for and what it will expand to when invoked. Here's how the entity we have defined can be used later in the DTD: %HTML.Version; Note that this time, there is no space between the % and the entity name. The trailing semicolon may be omitted in certain contexts. Unfortunately, this part of SGML syntax clashes with one of Netscape HTML extensions, namely using the % character for specifying sizes of images and other elements as percentages of window dimensions. This is why HTML validators that check a HTML document against a DTD sometimes have trouble with this feature. |
Public Identifiers |
The last part of the %HTML.Version; entity declaration is the comment that reminds us about the necessity (unambiguously stated in HTML specification) to start any HTML document that is intended to be a valid SGML document with a DOCTYPE declaration. This allows an SGML parser to know at once that the structure and tags of the document it's about to process are described in the DTD identified (in our case) by its URL. Of course, HTML (not SGML) browsers could also make use of this information to select the level of HTML support needed for the document (although only a few of them really do). The use of a URL as a DTD identifier is rather unusual (it is probably explained by the fact that, at the time of this writing, HTML 4.0 DTD was still evolving). More often, to refer to external information sources, SGML documents use public identifiers of a special form. For example, in HTML 3.2 DTD the %HTML.Version; entity expands into the string -//W3C//DTD HTML 3.2 Final//EN which is the public identifier of this version's DTD. Another example is the identifier string of a character set standard used for the BASESET parameter in SGML declaration (see "CHARSET Section," earlier in this chapter). Any DTD or related standard has a unique public identifier assigned in order to allow referring to this standard from other SGML documents. Such references are usually made via parameter entities. If the data string in an entity declaration is preceded by the additional keyword PUBLIC, this means that the string is not the entity value but a public identifier pointing to an external information source. For example, the HTML 4.0 DTD is accompanied by a set of general entities for accessing characters of ISO Latin-1 isolated in a separate document with its unique public identifier. Here's how this document is incorporated into HTML DTD: <!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin1//EN//HTML"> Here, the string in quotes contains the public identifier of the external resource whose contents will be substituted for each occurrence of the entity %HTMLlat1;. To make the mentioned document part of the DTD, it is now enough to invoke the defined entity (actually this is done right after its declaration): %HTMLlat1; Formal rules for constructing public identifiers need not be detailed here. There exists a fairly complete catalog of public identifiers. |
General Entities |
General entities are declared in the DTD similarly to parameter entities, but they have a number of differences:
For an example, consider the entity declarations provided in the DTD for accessing four special characters: <!ENTITY amp CDATA "&" -- ampersand --> <!ENTITY gt CDATA ">" -- greater than --> <!ENTITY lt CDATA "<" -- less than --> <!ENTITY quot CDATA """ -- double quote --> This example also shows us one special kind of entity called character reference that does not require any declaration. If the entity opening delimiter & is immediately followed by the # character and a number, this number is interpreted as character code (from the document character set as defined in the SGML declaration, see "CHARSET Section") and the whole entity is replaced by the character having this code. This is one of the two methods to access characters that are beyond the reach of a computer keyboard; the other method uses the mnemonic character entities defined in the DTD, such as & or é. You might wonder how the entities in the preceding example could expand to special characters if the CDATA keyword prohibits any SGML instructions, character references included, from having effect in the data string. The answer is that this string is in fact read twice: the first time when the entity declaration is interpreted, and the second time when the entity is used in the document and its data string is substituted. The CDATA keyword affects only the first reading. As a result, the DTD is protected from the special characters, while in the document the references are expanded to the characters intended. |
Elements |
As I've already mentioned in "How to Define an SGML Application," a document marked up with an SGML application is thought of as consisting of a hierarchy of nested elements. A marked up element is usually enclosed in a pair of start and end tags. The ELEMENT statement in SGML defines both start and end tags (but not their attributes) and prescribes what may be the content of this element by defining its content model. Here's an example of element declaration: <!ELEMENT P - O (%text)*> Here, P is the element name (short for Paragraph). The two characters following the element name are minimization indicators specifying whether it is possible to omit start and/or end tags for this element. The first indicator refers to the start tag, and the second, to the end tag. In place of a minimization indicator, you can put either a hyphen (-), meaning that the tag is obligatory, or the letter O, meaning that the tag is omissible. Thus, the preceding statement declares that a P element (a paragraph) must be preceded by the <P> start tag, while the </P> end tag can be omitted. It is possible to have both start and end tags omissible. For example, the declaration <!ELEMENT HTML O O (%html.content)> indicates that both <HTML> and </HTML> tags around the content of an HTML document can be dropped. |
Content Model Keywords |
The last component in the element declarations above is the content model specification. (Here, it is done via parameter entities, and to see what they expand to we should find the corresponding ENTITY statements in the DTD.) A content model declares what can, what must, and what must not go inside the element. The simplest type of content model is specified by a single keyword from the following list:
<!ELEMENT IMG - O EMPTY -- Embedded image -->
|
Content Model Groups |
Sometimes, however, it is necessary to be more specific in defining content model of an element. This is done via content model groups whose syntax deserves a more thorough examination. The simplest model group is one element name enclosed in parentheses, which means that the element being defined must contain one occurrence of the element specified in content model and nothing else. This is a rather artificial situation, as more often a model group contains two or more element names---for example, <!ELEMENT HTML O O (HEAD, BODY)> Here, the comma between HEAD and BODY is a connector used to indicate the relations between the elements listed. Possible connectors include the following:
|
<!ENTITY % head.content "TITLE & ISINDEX? & BASE?"> |
Here's the list of occurrence indicators used to show how many times the elements can occur in a content model:
Model groups can be nested, and the occurrence indicators may apply to an entire group rather than a single element: <!ELEMENT DL - - (DT|DD)+> This means that within a DL (Definition List) element, at least one (but possibly more) DT or DD elements must be present. Besides element names, you can use the #PCDATA (Parsed Character DATA) keyword in model groups. It refers to "usual" characters of the document without any markup tags and can be used to explicitly allow or disallow plain text within an element. It is different, however, from the CDATA keyword discussed earlier. First, #PCDATA can be used only within a model group and not on its own as CDATA (that is, #PCDATA should be enclosed in parentheses even when it stands alone). And second, #PCDATA does not imply ignoring markup; if a tag is encountered in the context where only #PCDATA is allowed, a compliant SGML parser should fix an error rather than ignore this tag. Together with the connectors and occurrence indicators listed, #PCDATA can limit the set of elements allowed inside another element without prohibiting plain text from appearing there. For example, here's how the %text; entity is defined via a number of subordinate classifying entities: |
<!ENTITY % font "TT | I | B | U | S | BIG | SMALL | SUB | SUP"> <!ENTITY % phrase "EM | STRONG | DFN | CODE | SAMP | KBD | VAR | CITE"> <!ENTITY % special "A | IMG | APPLET | OBJECT | FONT | BASEFONT | BR | SCRIPT | MAP | Q | SPAN | INS | DEL | BDO | IFRAME"> <!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON"> <!ENTITY % text "#PCDATA | %font | %phrase | %special | %formctrl"> |
Thus the %text; entity stands for, in plain English, "either a chunk of text or one of all these listed elements." Obviously, it'll most often be used with the * occurrence indicator. For an example, see how the preceding declarations are used once more to define quite a number of elements in one snap: <!ELEMENT (%font|%phrase) - - (%text)*> As you see here, both parameter entities and groups can be used for specifying element names in declarations, not only in their content models. SGML syntax also allows notation of the addition or subtraction of model groups, which is very convenient if these groups are specified via entity references. For instance, the FORM element is allowed to contain anything that can occur within a block-level element (that is, an element that starts a new paragraph) except for the FORM element itself (that is, FORMs cannot be nested). Rather than define the new content group from scratch, we can make use of the already defined %block.content; entity by subtracting the single FORM element from it: <!ELEMENT FORM - - %block.content -(FORM)> Analogously, we can sum up two model groups: <!ELEMENT HEAD O O (%head.content) +(%head.misc)> |
Attributes |
An element is not fully described by its name and content model. Many elements have associated attributes that serve to provide additional information for rendering the element. Attributes for each element should be declared in the DTD via ATTLIST statements. Here's a typical attribute declaration for an element: |
<!ATTLIST AREA shape %SHAPE rect -- controls interpretation of coords -- coords %COORDS #IMPLIED -- comma separated list of values -- href %URL #IMPLIED -- this region acts as hypertext link -- target CDATA #IMPLIED -- where to render linked resource -- nohref (nohref) #IMPLIED -- this region has no action -- alt CDATA #REQUIRED -- description for text only browsers -- tabindex NUMBER #IMPLIED -- position in tabbing order -- onClick %script #IMPLIED -- intrinsic event -- onMouseOver %script #IMPLIED -- intrinsic event -- onMouseOut %script #IMPLIED -- intrinsic event -- > |
Right after the ATTLIST keyword, the name of the element for which we're defining attributes is specified. Next comes a number of three-component groups, each defining one attribute. The first identifier in each group is the attribute name. The other two specify the type of value for the attribute and its default value, as detailed in the next sections. |
Type of Attribute Value |
After the name of each attribute in the ATTLIST declaration comes a keyword describing its type. This keyword is usually taken from the following list:
|
<!ATTLIST META ... http-equiv NAME #IMPLIED -- HTTP response header name -- name NAME #IMPLIED -- metainformation name -- ... > |
|
<!ATTLIST OL -- ordered lists -- ... compact (compact) #IMPLIED -- reduced interitem spacing -- start NUMBER #IMPLIED -- starting sequence number -- ... > |
Besides these keywords, you can specify the list of possible values directly using the group notation that you've already seen applied for model groups in this chapter. Thus, in the preceding ATTLIST declaration for the OL element, the COMPACT attribute may only take as value the character string "compact" or have no value at all, as in the example <OL START=1 COMPACT> which is equivalent to <OL START=1 COMPACT=COMPACT> Here's an example from the DTD with an attribute taking one of three possible values: <!ATTLIST table ... align (left|center|right) #IMPLIED ... > |
Default Value Specification |
Finally, for each attribute in an ATTLIST declaration, either a default value is provided or a keyword is specified indicating whether this attribute is changeable and/or required. In this position, character strings need not be enclosed in parentheses (although they should be put in quotes if they contain spaces or delimiters), but the keywords require using a # escape character as in the #PCDATA keyword mentioned earlier. Here's a part of ATTLIST for TH and TD elements showing default values for ROWSPAN and COLSPAN attributes: |
<!ATTLIST (th|td) -- header or data cell -- ... rowspan NUMBER 1 -- number of rows spanned by cell -- colspan NUMBER 1 -- number of cols spanned by cell -- ... > |
More often, however, you'll see in place of the default value a keyword from the following list:
|
<!ATTLIST PARAM name CDATA #REQUIRED -- property name -- value CDATA #IMPLIED -- property value -- ... > |
Deprecated Features |
Sometimes, a part of the DTD must be processed in a different way than the rest of it. For this, SGML offers the generic mechanism of marked sections that make it possible to isolate any markup statements and declarations in order to control their processing. HTML DTD uses this mechanism to mark its deprecated features that should be avoided in documents but are kept in the DTD for backwards compatibility. Here's what a marked section looks like: <![ %HTML.Deprecated [ <!ENTITY % preformatted "PRE | XMP | LISTING"> ]]> The %HTML.Deprecated; entity expands into the special keyword that tells the parser what to do with the contents of the section. The two keywords used in various HTML DTDs are IGNORE and INCLUDE. The IGNORE keyword allows to ignore the marked section completely, and the INCLUDE keyword prescribes to process its contents on equal terms with the rest of DTD. So, to get a "strict" version of a DTD, all you need to do is to change the declaration <!ENTITY % HTML.Deprecated "INCLUDE"> to <!ENTITY % HTML.Deprecated "IGNORE"> |
Revised: Jun. 16, 1997
URL: https://www.webreference.com/dlab/books/html/3-5.html