Introductory Notes on the TEI Guidelines SGML TEI TUTORIAL NO. 2: INTRODUCING SGML One of the TEI's basic goals, as originally formulated in 1987, was the definition of an encoding scheme expressive enough to allow texts in all widely-used encoding schemes to be translated into it without loss of information. This poses two technical challenges for any encoding scheme: it must be extensible, to allow hitherto-unheard-of analyses of texts to be expressed in tags, and it must be able to represent any structural relationship among text segments which can be tagged in any scheme. Already at the first planning meeting, several people suggested Stan- dard Generalized Markup Language (ISO 8879, SGML) as the most promising basis for such an encoding scheme. Others were concerned that SGML was not powerful enough to meet the needs of the TEI, or not sufficiently well understood, or too verbose or of too dubious parentage or inade- quately supported. Only time will tell how far these concerns were jus- tified: at present, no alternative candidate has been proposed which comes even close to the acceptability or applicability of SGML. Chapter two of the Guidelines contains a "gentle introduction to SGML" intended for readers who have not encountered it before. Many such introductions spend most of their time on the origins and histori- cal development of SGML or on polemics against alternative models of text processing. Our own experience reading these introductions is that they don't always succeed in describing very clearly what SGML actually is; we try to provide such a description and leave the polemics aside for a while. TEXT AND ITS STRUCTURE So what, in fact, is SGML? Well, just as Fortran, C or Pascal are languages in which algorithms may be expressed, so SGML is a language in which text structures may be expressed. It allows you to name particu- lar types of textual object (such as plays, novels, reports ... ) and their constituent elements (speeches, chapters, paragraphs ...). It allows you to state rules about how occurrences of those types may legally appear in real texts (for example, that poems may contain stan- zas, but paragraphs cannot). These and other types of rule make up what is called in the jargon a document type definition (DTD); this allows a suitable piece of software to check that texts match their intended def- inition, and also to take advantage, in interpreting the text, of the knowledge about the text encapsulated by the definition. Text is not, for the TEI, a mass of unruly strings to be reorganized into nice patterns on paper. It is a tissue of nested objects of vari- ous kinds, the order and structure of which are crucial to its under- standing: an "ordered hierarchy of content objects" as De Rose and com- pany call it in their recent and highly recommended article "What is text really" Journal of Computing in Higher Education, 1.2 (1990): 3-26. This concern with text as a complex structure is one reason for preferring SGML to other, perhaps better known, markup systems, in which structural complexity is recognized only insofar as it affects the printing of a text, or insofar as it affects the precision and recall with which it may be recovered from a text retrieval system. How is this done? For the full story, you will have to read else- where. For a superficial overview, read on. If (like me) you have read enough superficial overviews to last you a lifetime, accept my apolo- gies, and tell us what you do want to read on this list! WHAT IT LOOKS LIKE In SGML terms, a text (or "document") is composed of content and markup, which are distinguished by special flag characters. The text is made up of elements which contain other elements or just text. The markup identifies the boundaries of the different elements in which the content is held. Consider a mail message like the one you are reading at the moment: it could be considered as a single object, called a "message". Like everything else, it has a begining and an end: SGML requires us to mark these explicitly. We might also say that messages always have two parts: a header (all the chattering between networks at the top), and a body (the rest). Again, SGML requires us to mark explicitly the boundaries of these two constituents. We might further subdivide both the header (it contains a "from", a "date", a "subject" etc.) and the body (it consists of paragraphs, may have a title or a signature etc.). The SGML standard suggests, for reference purposes, one particular way of marking these boundaries: you introduce a start tag (which looks like <this>) at the beginning of each "this" object, and an end tag (which looks like </this>) at the end. This method (which rejoices in the name of "the reference concrete syntax") is not however obligatory: you can use square brackets and numbers, binary escape sequences, big red stars in the margin or musical notes if your keyboard will allow you. You can even use conventional punctuation or layout information, provided that this can be unambiguously mapped onto the element struc- ture. So, tagged (more fully than might commonly be done) in SGML, the start of this message might look something like this: <message> <header><from>Lou</from> <to>Rest of World</to> <date>20 Aug <year>1990</year></date> <subject><abbrev>SGML</abbrev> - the basics</subject> </header> <body> <title>TEI Tutorial no 2: SGML</title> <paragraph> <sentence>One of the <abbrev>TEI</abbrev>'s basic goals, as originally formulated in <date>1987</date>, was the definition of an encoding scheme expressive enough to allow texts in all widely-used encoding schemes to be translated into it without loss of information.</sentence> <sentence>This poses two technical challenges for any encoding scheme: it must be extensible, to allow hitherto-unheard-of analyses of texts ... </body></message> Notice that we use exactly the same sort of markup for subdivisions of the header (which would probably be regarded as obvious candidates for information retrieval fields) as we do for the body of the text. Notice also that some types of object (year and abbrev for example) can appear in more than one type of enclosing object. Explicitly labelled and delimited objects, described by a hierarchi- cally organized grammar, is almost all there is to know about SGML. Two further wrinkles remain to be discussed: entities and attributes. ENTITIES For many years, a central problem in the encoding and interchange of scholary texts has been to agree standards for the representation of the graphemes used in all the languages and scripts of importance to the TEI community, as Michael defined it in our previous posting. Because of the unaccountable lack of interest shown by standards bodies and comput- ing industry alike for mediaeval manuscripts or Old High Glagolitic, ad hoc transliteration schemes have proliferated. For many people, defin- ing just such standards is one of the major jobs for the TEI. In a lat- er posting, we will discuss what TEI has actually proposed in this con- nexion: here I just want to point to the remarkably simple solutions offered by SGML itself. Firstly, SGML requires that a DTD specify the character set in which texts using it are encoded: this will normally be an internationally agreed one such as ISO 646 (not quite, but almost, the same as ASCII) but if you insist on rolling your own, SGML at least offers you a means of telling the world what you have done. Secondly, SGML includes a general purpose mechanism for string sub- stitution. Arbitrary text segments known as "entities", ranging in size from individual characters to whole chunks of text, may be named in a DTD, and then invoked by reference at any point when it is impossible or inconvenient to enter them directly into a document. An entity refer- ence in a text is conventionally preceded by an ampersand and followed by a semicolon, thus taking the form &;name;. An SGML processor is pro- vided by the DTD with both the names of all such entities used in a doc- ument and with translations for them appropriate to the particular machine environment in which the processor is running. Standard sets of names have been proposed by ISO for such things as accented letters, mathematical and typographic symbols etc. which TEI will follow wherev- er possible. ATTRIBUTES It is occasionally useful to be able to record some information asso- ciated with a textual element, but not regarded as being a textual ele- ment in its own right. Examples include identifying numbers, canonical references, status markers etc. Attributes are the mechanism provided by SGML for this purpose. The DTD defines attribute names, their possi- ble values (within some limits) and the elements to which they can be attached. In an SGML tagged texts, attribute values must be supplied within the opening tag for an element. For example, assume the element MESSAGE has an attribute ID, used to supply a unique identifying number for each MESSAGE element. The start of message number 42 would then be indicated: <message id=n42> The use of attributes is slightly controversial, in that they are formally redundant: anything that can be done with them, can also be done without them. However, they are very widely used in most existing SGML systems. Moreover, without them, markup of concurrent structures becomes unbearably complicated. But that is another story. Lou Burnard Editor, TEI