Introductory Notes on the TEI Guidelines
SGML
TEI TUTORIAL NO. 2: INTRODUCING SGML
One of the TEI's basic goals, as originally formulated in 1987, was
the definition of an encoding scheme expressive enough to allow texts in
all widely-used encoding schemes to be translated into it without loss
of information. This poses two technical challenges for any encoding
scheme: it must be extensible, to allow hitherto-unheard-of analyses of
texts to be expressed in tags, and it must be able to represent any
structural relationship among text segments which can be tagged in any
scheme.
Already at the first planning meeting, several people suggested Stan-
dard Generalized Markup Language (ISO 8879, SGML) as the most promising
basis for such an encoding scheme. Others were concerned that SGML was
not powerful enough to meet the needs of the TEI, or not sufficiently
well understood, or too verbose or of too dubious parentage or inade-
quately supported. Only time will tell how far these concerns were jus-
tified: at present, no alternative candidate has been proposed which
comes even close to the acceptability or applicability of SGML.
Chapter two of the Guidelines contains a "gentle introduction to
SGML" intended for readers who have not encountered it before. Many
such introductions spend most of their time on the origins and histori-
cal development of SGML or on polemics against alternative models of
text processing. Our own experience reading these introductions is that
they don't always succeed in describing very clearly what SGML actually
is; we try to provide such a description and leave the polemics aside
for a while.
TEXT AND ITS STRUCTURE
So what, in fact, is SGML? Well, just as Fortran, C or Pascal are
languages in which algorithms may be expressed, so SGML is a language in
which text structures may be expressed. It allows you to name particu-
lar types of textual object (such as plays, novels, reports ... ) and
their constituent elements (speeches, chapters, paragraphs ...). It
allows you to state rules about how occurrences of those types may
legally appear in real texts (for example, that poems may contain stan-
zas, but paragraphs cannot). These and other types of rule make up what
is called in the jargon a document type definition (DTD); this allows a
suitable piece of software to check that texts match their intended def-
inition, and also to take advantage, in interpreting the text, of the
knowledge about the text encapsulated by the definition.
Text is not, for the TEI, a mass of unruly strings to be reorganized
into nice patterns on paper. It is a tissue of nested objects of vari-
ous kinds, the order and structure of which are crucial to its under-
standing: an "ordered hierarchy of content objects" as De Rose and com-
pany call it in their recent and highly recommended article "What is
text really" Journal of Computing in Higher Education, 1.2 (1990):
3-26. This concern with text as a complex structure is one reason for
preferring SGML to other, perhaps better known, markup systems, in which
structural complexity is recognized only insofar as it affects the
printing of a text, or insofar as it affects the precision and recall
with which it may be recovered from a text retrieval system.
How is this done? For the full story, you will have to read else-
where. For a superficial overview, read on. If (like me) you have read
enough superficial overviews to last you a lifetime, accept my apolo-
gies, and tell us what you do want to read on this list!
WHAT IT LOOKS LIKE
In SGML terms, a text (or "document") is composed of content and
markup, which are distinguished by special flag characters. The text is
made up of elements which contain other elements or just text. The
markup identifies the boundaries of the different elements in which the
content is held. Consider a mail message like the one you are reading
at the moment: it could be considered as a single object, called a
"message". Like everything else, it has a begining and an end: SGML
requires us to mark these explicitly. We might also say that messages
always have two parts: a header (all the chattering between networks at
the top), and a body (the rest). Again, SGML requires us to mark
explicitly the boundaries of these two constituents. We might further
subdivide both the header (it contains a "from", a "date", a "subject"
etc.) and the body (it consists of paragraphs, may have a title or a
signature etc.).
The SGML standard suggests, for reference purposes, one particular
way of marking these boundaries: you introduce a start tag (which looks
like ) at the beginning of each "this" object, and an end tag
(which looks like ) at the end. This method (which rejoices in
the name of "the reference concrete syntax") is not however obligatory:
you can use square brackets and numbers, binary escape sequences, big
red stars in the margin or musical notes if your keyboard will allow
you. You can even use conventional punctuation or layout information,
provided that this can be unambiguously mapped onto the element struc-
ture.
So, tagged (more fully than might commonly be done) in SGML, the
start of this message might look something like this:
Lou
Rest of World
20 Aug 1990
SGML - the basics
TEI Tutorial no 2: SGML
One of the TEI's basic goals, as
originally formulated in 1987, was the definition
of an encoding scheme expressive enough to allow texts in all
widely-used encoding schemes to be translated into it without
loss of information. This poses two
technical challenges for any encoding scheme: it must be
extensible, to allow hitherto-unheard-of analyses of texts ...
Notice that we use exactly the same sort of markup for subdivisions
of the header (which would probably be regarded as obvious candidates
for information retrieval fields) as we do for the body of the text.
Notice also that some types of object (year and abbrev for example) can
appear in more than one type of enclosing object.
Explicitly labelled and delimited objects, described by a hierarchi-
cally organized grammar, is almost all there is to know about SGML. Two
further wrinkles remain to be discussed: entities and attributes.
ENTITIES
For many years, a central problem in the encoding and interchange of
scholary texts has been to agree standards for the representation of the
graphemes used in all the languages and scripts of importance to the TEI
community, as Michael defined it in our previous posting. Because of
the unaccountable lack of interest shown by standards bodies and comput-
ing industry alike for mediaeval manuscripts or Old High Glagolitic, ad
hoc transliteration schemes have proliferated. For many people, defin-
ing just such standards is one of the major jobs for the TEI. In a lat-
er posting, we will discuss what TEI has actually proposed in this con-
nexion: here I just want to point to the remarkably simple solutions
offered by SGML itself.
Firstly, SGML requires that a DTD specify the character set in which
texts using it are encoded: this will normally be an internationally
agreed one such as ISO 646 (not quite, but almost, the same as ASCII)
but if you insist on rolling your own, SGML at least offers you a means
of telling the world what you have done.
Secondly, SGML includes a general purpose mechanism for string sub-
stitution. Arbitrary text segments known as "entities", ranging in size
from individual characters to whole chunks of text, may be named in a
DTD, and then invoked by reference at any point when it is impossible or
inconvenient to enter them directly into a document. An entity refer-
ence in a text is conventionally preceded by an ampersand and followed
by a semicolon, thus taking the form &;name;. An SGML processor is pro-
vided by the DTD with both the names of all such entities used in a doc-
ument and with translations for them appropriate to the particular
machine environment in which the processor is running. Standard sets of
names have been proposed by ISO for such things as accented letters,
mathematical and typographic symbols etc. which TEI will follow wherev-
er possible.
ATTRIBUTES
It is occasionally useful to be able to record some information asso-
ciated with a textual element, but not regarded as being a textual ele-
ment in its own right. Examples include identifying numbers, canonical
references, status markers etc. Attributes are the mechanism provided
by SGML for this purpose. The DTD defines attribute names, their possi-
ble values (within some limits) and the elements to which they can be
attached. In an SGML tagged texts, attribute values must be supplied
within the opening tag for an element. For example, assume the element
MESSAGE has an attribute ID, used to supply a unique identifying number
for each MESSAGE element. The start of message number 42 would then be
indicated:
The use of attributes is slightly controversial, in that they are
formally redundant: anything that can be done with them, can also be
done without them. However, they are very widely used in most existing
SGML systems. Moreover, without them, markup of concurrent structures
becomes unbearably complicated. But that is another story.
Lou Burnard
Editor, TEI