The SGML reference concrete syntax has two great advantages over most other ways of making concrete a view of the abstract structure of a markup language: everything is delimited (bracketed) explicitly, and very few special characters are needed. As we have already seen, entity references are delimited explicitly by the ampersand character and the semicolon. [See note 7] In the same way, element occurrences within an SGML document are explicitly delimited in the reference concrete syntax by named tags. There are two kinds of tag: start-tags, which indicate the beginning of an element, and end-tags, which indicate its end. The tags themselves are delimited by special characters: ``<'' to mark the beginning of a start-tag, and ``'' to mark the beginning of an end-tag. In either case, the character ``>'' is used to indicate the end of a tag. Between these delimiters is given a name identifying the type of element delimited by the start- and end-tag pair. For example, an embedded name element in a text might be tagged as follows:
Call me <name>Ishmael</name>.This is by no means the only way of indicating the presence of an SGML element within a text; it is however the most explicit, and hence that into which other representations are most generally mapped.
The content of a document element of a particular type (that is, the portion between the start and end tags) may consist simply of running text, perhaps including entity references. More usually, it will contain other embedded document elements; occasionally it may have no content at all. The ability of SGML to specify rules about how elements can nest within other elements is one of its chief strengths and is discussed further below. Here we simply note that elements of one type typically contain elements of another: for example, a parish register consists of a mixture of birth, marriage and death records, each of which contains elements such as names, dates and details of an event. We might thus expect to find such records encoded in SGML with different tags for <birth>, <marriage> and <death> elements, within each of which might be found <name> and <date> elements. In exactly the same way, a document such as this one might be encoded as a series of <paper> elements, each of which begins with a <title>, followed optionally by an <abstract>, and at least one (and probably several) <section>s, each composed of <paragraph>s.
An empty element (one which has no content) may seem like a contradiction: what use can it be simply to tag a specific point in a text, especially if there is no way of associating information with it? At the very least, it should be possible to supply a name or other identifier to distinguish one such empty point in a text from another. Fortunately, SGML does provide a mechanism for adding such ``extra-textual'' information to the elements of a text: that of attributes, discussed in the next section.
Call me <name type=Biblical>Ishmael</name>.Here ``type'' is the name of an attribute associated with any occurrence of the <name> element; ``Biblical'' is the value defined for this attribute in the case of the example <name> shown above. [See note 8]
Attributes are used for two related purposes: they enable an identifying number or name to be associated with a particular element occurrence within a text (which might otherwise be missing), and they enable additional information missing from a text to be added to it without violating its integrity.
As an example of the first usage, consider the page or folio numbering of a historical source. There is a sense in which the individual pages of a source might be regarded as distinct elements within it. This is not however generally the primary focus of interest for those using it: in most cases, the number of the page only is of importance as a means of documenting where the other elements of the text occur. Moreover, the page numbers may not appear at all in the original source. In such cases, a tag <pb> (for ``page break'') may conveniently be used to mark the point in the text at which a new page begins. An attribute (say, n for ``number'') would then provide a convenient means of indicating the number of the page just begun: thus
text of page 3 ends here <pb n=4> text of page 4 starts hereAs an example of the second usage, consider the common need for normalisation in prosopographical studies. One way of achieving this might be associate an attribute such as ``key'' with each occurrence of <name> elements in a text, the value of which would be a regularized and encoded form of the name, which could also serve as an identifying key in a database derived from the text. For example:
<name key='SMITJ04'>Jack Smyth</name>Attribute values may be defaulted, taken from a controlled list or specified freely, the only constraint being that they cannot contain markup.
The most common use for attributes in the TEI and other SGML schemes is not however to categorise element occurrences in this way, but to identify them. In the TEI scheme, for example, every element is defined to have an ID attribute, which supplies a unique identifier for that particular textual element within the text. This makes possible the encoding of links between individual elements of a text in a simple and economical way. This facility is very commonly used in document preparation systems (such as TeX or Scribe) in order to link cross-references (such as ``see section 3 above'') within a text with the sections of a text to which they refer, when the section number is not known or may be dynamic. In SGML, such a system is completely generalizable. For example, let us suppose that we wish to encode a register of names in which the following passage occurs:
John Smith, baker. Mary Smith, seamstress, wife of the above.In this example we have two <entry>s, each containing a <name> and a <trade>. The second entry however contains an additional clause which states a relationship between it and another element. We begin by tagging the elements so far identified: [See note 9]
<register> <entry> <name>John Smith</name> <trade>baker</trade> </entry> <entry> <name>Mary Smith</name> <trade>seamstress</trade> <relation>wife of the above</relation> </entry> ... </register>Clearly ``wife of the above'' is meaningless as a relation unless we have some way of pointing to the entry with which it is linked. Let us assume that the referent of ``the above'' is the whole of John Smith's entry rather than just the name within it; the assumption does not affect the argument. What is needed is some way of identifying that entry uniquely; that identifying number can then be supplied as the target of the relationship. In other words, we need an identifying attribute (call it ``id'') that can be attached to any <entry> and a pointer attribute (call it ``target'') which can be attached to any <relation>. Using these, and inventing an arbitrary value for the identifier, we can encode the link implicit in the above text as follows:
<register> <entry id=E1234> <name>John Smith</name> <trade>baker</trade> </entry> <entry> <name>Mary Smith</name> <trade>seamstress</trade> <relation target=E1234>wife of the above</relation> </entry> ...Here we have allocated the arbitrary name or identifier ``E1234'' to the Baker's entry. By supplying that same identifier as the value for the target attribute associated with the <relation> element of the Seamstress' entry, we assert both the existence of the relationshiop itself, but also its target. This simple solution to a well-known problem has several attractive features, but perhaps the most attractive is that it makes explicit the fact that the target of the relationship is an interpretation brought to bear on the text by the encoder of it, leaving the text itself unchanged. Other attributes (say, ``certainty'' or ``authority'') may also be imagined which might carry additional interpretative information associated with the link.