TEI MLW2 - On the elimination of attributes
CMSMcQ, 23 March 89
The draft Sandy circulated in mid-February has, I think, a better
treatment of the question of attributes than we have had before. The
guidelines for choosing between attributes and tags still don't persuade
me, though, and the text sounds as though we thought the no-attributes
approach were a viable one for the example given (verb tags). In
reality, I think the example given is a very persuasive demonstration of
why attributes are necessary -- not only for clarity and elegance in the
tag set, but for processing.
Here I try to show why I think so, taking a fairly obvious case of the
same example. If we wish to encode the traditional analysis of the
surface grammar of a standard Latin text, ignoring complications like
variant forms, deponents, periphrastic forms, participles, exceptions,
and the study of grammar since about the fifth century, we might say:
1 Each word has a part of speech: noun, adjective, verb, adverb,
pronoun, preposition, conjunction, or interjection. (I have the nasty
feeling I'm leaving one out: there should be nine, but maybe I'm
thinking of English, which adds articles.)
2 Each noun or adjective is distinguished by number, gender, case,
and declension pattern.
3 Each verb form is distinguished by person, number, voice, tense,
mood, and conjugation pattern.
4 Each pronoun is distinguished by person, number, and case. Third
person pronouns are additionally marked for gender and demonstrative
class (unmarked, close, middle, distant).
5 Each conjunction is either coordinating or subordinating.
6 Each preposition takes a specific case, or takes either dative or
accusative, but we can omit that from our encoding because it does not
vary from use to use (and when it does, we can look at the preposition's
object).
7 Adverbs and interjections we can just tag with their part of
speech.
If we encode with attributes, our declaration will look something like
this (if PCDATA is what these should include):
That makes forty-seven discrete values in twelve categories:
1 generic identifier 8 values
2 number 2
3 gender 3
4 case 6
5 declension 5
6 person 3
7 voice 2
8 tense 6
9 mood 2
10 conjugation 4
11 demonstrative class 4
12 conjunction type 2
-----
(total) 47
The alternative simplifies from twelve categories to one, by eliminating
attributes and moving their information into the tag. Since we need a
separate tag for each combination of attributes, the count K of tags
needed for each class of word is the product of the number of possible
values for each attribute:
K(adverbs) = 1
K(prepositions) = 1
K(interjections) = 1
K(conjunctions) = 2
K(nouns) = number * gender * case * declension
= 2 * 3 * 6 * 5
= 180
K(adjectives) = number * gender * case * declension
= 2 * 3 * 6 * 5
= 180
K(pronouns) = non-third-person + third-person
= person * number * case + number * case * gender * demons
= 2 * 2 * 6 + 2 * 6 * 3 * 4
= 24 + 144
= 168
K(verbs) = person * number * voice * tense * mood * conj
= 3 * 2 * 2 * 6 * 2 * 4
= 576
This yields a total of 1109 discrete values in one category. If we
pushed this a little harder (adding the three other Indo-European cases,
worrying about the twenty-five or so cases of Finnish or Hungarian,
adding gender to verbs to handle Hebrew, adding aspect to verbs to
handle Greek and Russian, adding a 'common' gender for modern
Scandinavian and Dutch, adding 'objective' or 'oblique' as a value for
modern English and other degenerate case systems), we could surely push
it up to four or five thousand tags for morphology, without working very
hard -- and without coming close enough to completeness for the scheme
to be seriously usable for the languages we are pledged to support.
If we add the need for less-than-full specification of morphology, (e.g.
for cases where the analyst does not wish to decide, yet, whether a noun
is accusative or dative), we have to generate not the set of all
possible combinations of attributes but the set of all possible
combinations of values and missing values. Add one to each factor in
the multiplications above. The results:
K'(adverbs) = 1
K'(prepositions) = 1
K'(interjections) = 1
K'(conjunctions) = 3
K'(nouns) = 504
K'(adjectives) = 504
K'(pronouns) = 483
K'(verbs) = 3780
Total: 5277 tags.
This seems kind of unwieldy to me: too large not just for practical use
in manual encoding, but for practical processing. Aesthetically, too,
it's rather disappointing. It reminds me of a column in Jon Bentley's
book Programming Pearls:
Most programmers have seen them, and most good programmers
realize they've written at least one. They are huge, messy,
ugly programs that should have been short, clean, beautiful
programs. I once saw a COBOL program whose guts were
IF THISINPUT IS EQUAL TO 001 ADD 1 TO COUNT001.
IF THISINPUT IS EQUAL TO 002 ADD 1 TO COUNT002.
...
IF THISINPUT IS EQUAL TO 500 ADD 1 TO COUNT500.
[... The program] contained about 1600 lines of code: 500 to
define the variables COUNT001 through COUNT500, the above 500 to
do the counting, 500 to print out how many times each integer
was found, and 100 miscellaneous statements. Think for a minute
about how you could accomplish the same task with a program just
a tiny fraction of the size [...].
I haven't mentioned yet that a basic task in such work is to provide the
dictionary form for any inflected token. Since there is no chance of
providing that information in the tag, it has to be provided either as
content or as an attribute value. It seems to me, intrinsically, to be
an attribute value.
I'll stop here, rather than pound this any harder. If I seem a bit
obdurate in my insistence that attributes be allowed, this kind of
combinatorial explosion is one reason. It needs to be borne in mind
that we have tagging of almost precisely this sort and level of detail
for fairly substantial bodies of ancient Greek, Hebrew, and probably
Latin. If the TEI scheme is to support existing taggings, what I have
sketched above is a *minimal* apparatus.
It is fair to admit that one can get by with far fewer tags in English:
the Brown corpus uses under 200. But the failure to separate concepts
like part-of-speech and inflectional information means the Brown scheme
lacks all generality -- I can't use it, for example, to tag my texts
in Middle High German.
My conclusion is that attributes are essential to serious tagging
of literary and linguistic material. We would have to use them
even if they did not help make clear, beautiful, compact tagging
schemes.