.sr docfile = &sysfnam. ;.sr docversion = 'Draft';.im teigmlp1 .* Document proper begins. Minutes of Work Group AI1 <title>Baltimore, 7-8 January 1991 <author>C. M. Sperberg-McQueen <!-- and Terry Langendoen --> <!-- I think you should be identified as the sole author. -TL --> <docnum>TEI &docfile. <date>&docdate. </titlep> <!> </frontm> <!> <body> <p> Present: Stephen R. Anderson (SRA), Nicoletta Calzolari (NC), D. Terence Langendoen (TL), Geoffrey R. Sampson (GRS), Gary F. Simons (GFS), C. M. Sperberg-McQueen (MSM). <!> <h1>1 Agenda <p> TL proposed the following agenda, which was adopted without change: <ol> <li>review of lexical markup schemata (word class and morphology) <li>clean up some technical problems in SGML definitions of linguistic tags <ol> <li>underspecification (what does omission of a feature mean?) <li>ambiguity and resolution of ambiguity <li>degree of certainty (especially in ambiguity resolution) <li>simplifications of TEI scheme for special grammars <li>sample taggings of linguistic examples <li>mixed-content models <li>review DTD and mechanism of inclusion <li>interface to work by committee TR (s unit, punctuation, etc.) </ol> </ol> <!> <h1>2 Lexical Markup <h2>2.1 Feasibility and Goals <p> The committee began by considering the feasibility of defining feature sets for part of speech and morphological information in the style of existing commonly used schemes. (Brown, LOB, et al.) GRS was of the opinion that consensus might be reached on feature sets for the major classes only, such that the features required for an individual language would be a subset of the whole. <p> For reference, MSM proposed two Venn diagrams, each showing two sets labeled L1 and L2, symbolic of an indeterminate number of sets L1 to Ln, which overlap in part. In one, a set U is defined as the union of L1 and L2 (and ... Ln). In the other, a set I is defined as the intersection of L1 and L2 (and ... Ln). <fn>The drawings actually drawn used the terms F (drawn as the universe of discourse within which L1 and L2 exist) and F' (drawn as a subset of the intersection of L1 and L2), but later discussion introduced the terms U and I and defined them as the union and intersection of L1 ... Ln, respectively.</fn> MSM inquired whether GRS meant that for the major categories, we might succeed in generating set U, but that for the minor categories we might only find set I. GRS replied that he did, but that the method in which one discovers distinctive word-class features was also important. <p> After discussion, the committee agreed that some set of common features for lexical markup could and should be devised. At the very least, the committee would arrive at a set I of features which a tagger could supplement with others (important to note for funding agencies and others that extension of the feature set is legal and expected); if it proves possible to generate a set U from which all features currently thought important for lexical markup could be generated, that would be attempted as well. SRA stressed the importance of allowing individual encoders to vary the level of granularity in their tagging. <!> <h2>2.2 Mechanism for Boilerplate <p> The feature sets are to be made available to users as boilerplate, that is, pre-defined feature-value pairs which can be embedded within feature structures with entity references, and possibly pre-defined feature structures, which can be embedded within texts using entity references. MSM discussed the form to be taken by the invocation of such boilerplate, and the ways in which a user could override the predefined values. A need was seen for two different files: one containing entity declarations for simple feature-value pairs, and one containing entity declarations for complex feature-value pairs and for feature structures. <p> A difficulty with predefined feature structures is that they do not allow the specification of unique IDs for the individual feature structures in the running text. NC asked whether it would not be simpler to use the SGML attribute declaration mechanism to define features and their possible values. This proposal was found to have some advantages, but on the whole to be less flexible than the use of the feature-structure tags, and so was not adopted. <!> <h2>2.3 Long-Term Goals for TEI Starter Set of Features <p> The committee defined its first goal as the definition of a set of features and their values, its second goal as the definition of a set of feature structures equivalent in information content to the tags of LOB and other commonly used annotation schemes, and its third goal as the formulation of recommended minimal feature sets for lexical markup. In terms of the mechanism defined, this appeared to require a single file of feature-value pairs (<q>primitive grammatical features</q>) and one file of feature structures (<q>compound grammatical features</q>) for each language to be specified. <!> <h2>2.4 Languages Considered <p> After a discussion of nouns (for results, see document AI1 W2), GFS suggested that the committee specify a set of languages to take into account in its first pass at a general set of features, then consult authoritative grammars for those languages to see what features and values would be needed for them, and to determine the standard terminology, and then to take the set union of features and values. <p> The committee agreed to take into account the nine official languages of the EEC and Russian, and assigned the following members to check respected grammars of them for features missing from the analysis produced by the meeting and for values missing from the value sets specified at the meeting. <!-- Add the following. -TL --> In addition, TL will assign a graduate student the task of checking a grammar of each of the ten languages except English and include the results of those checks in his report. <!-- End insert. -TL --> <ul> <li>Danish: SRA <li>Dutch: TL <li>English: GRS <li>French: MSM <li>German: MSM (originally assigned to GFS) <li>(Modern) Greek: NC <li>Italian: NC <li>Portuguese: GRS <li>Russian: SRA <li>Spanish: TL </ul> <action> <who>Work Group AI 1 (as specified) <act>to consult grammars and revise AI1 W2 on feature sets <duedate>31 January 1991 </action> <!> <h2>2.5 Features for Specific Word Classes <p> Much of the meeting was concerned with the development of feature sets for specific parts of speech: nouns, adjectives, verbs, etc. Results of this discussion are contained in working paper AI1 W2, <q>List of Common Morphological Features for Inclusion in TEI Starter Set of Grammatical-Annotation Tags,</q> and will not be repeated here. <!> <h2>2.6 Features for Tokens and for Types (Lemmata) <p> During consideration of specific word classes, the committee repeatedly discussed the possibility of including or excluding information associated not with the specific occurrence of a word, but with the lemma itself. After much deliberation, the committee determined not <!-- 'much' replaces 'mature' in preceding line. -TL --> to include, in the <q>starter set</q> being constructed, features fixed for a lemma once and for all. Nevertheless, in practice many grammatical annotations do include lemma-level information, and any division has implications for the markup of the lexicon. Some liaison with the work group on computational lexica will be necessary to ensure that the two groups do not work at cross purposes. <p> The definitions to be provided in the <q>starter set</q> are for word classes and for features which some language (of those considered) expresses morphologically. <q>Morphology</q> here is to be understood as including typographic features, so that the consistent capitalization of proper nouns in English, for example, allows (requires) the inclusion of a feature PROPER (or at least WORD-INITIAL-CAP) for nouns so marked. <!> <h2>2.7 Multiple Word-Class Assignments (Source Class, Usage Class) <p> For the not infrequent cases in which a word of one class is used with the function of another class (adjective-as-noun and participle-as-adjective are two very common examples), the committee decided to allow tagging according to the word class of the usage (in the examples: noun, adjective) or according to the source class (adjective, verb), at the discretion of the tagger. Arguments may be found in favor of either decision. <p> Where features associated with both classes are to be tagged, the committee decided to provide a universally available feature <term>source-class</term>, which would take as its value a feature structure valid for the source or lexical class of the word. The source-class feature should be embedded within a feature structure for the word class for the word's syntactic function. Within the committee's <q>starter set</q>, therefore, adjectives used as nouns may be tagged <ul> <li>as nouns (usage or functional class) <li>as adjectives (source or lexical class) <li>as nouns, with an embedded feature <term>source-class</term> whose value is a feature structure for adjectives </ul> Explicitly considered and rejected was a corresponding universal feature <term>function</term>, which would allow the functional-class feature structure to be embedded within the source-class feature structure. If both feature structures are to be encoded, the committee decided, the outer structure should be for the functional class. <p> How best to handle such double-tagging in the starter set remains uncertain, however. <p> As an example, the committee formulated the following feature structure for the word <ital>ihres</ital> 'her' in the German phrase <ital>die Titel ihres Buches</ital> 'the title of her book'. <xmp> [category = adjective gender = neuter number = singular case = genitive +possessive -interrogative source-category = [category = pronoun person = 3 number = singular gender = feminine case = n/a ] ] </xmp> <!> <h1>3 SGML Technical Matters <p> At this point the committee divided into two subgroups on lexical markup (SRA, NC, and GRS) and SGML issues (TL, GFS, and MSM). The latter group considered underspecification, ambiguity and its resolution, the problem of mixed-content models in SGML, and the DTD mechanisms used for embedding linguistic analysis in a document. The former group continued the specification of feature sets and considered the relevance of work by the TR committee for lexical markup. <!> <h2>3.1 Underspecification <p> Five possible interpretations were found for the omission of a feature specification from a feature set: <ol> <li id=any>The feature value is unrestricted. <li id=dft>The feature takes some default value. <li id=unk>The feature value is unknown. <li id=dna>The feature does not apply (has no value). <li id=huh>No claim is made about the feature's value or applicability. </ol> In order to disambiguate these, the following universal feature values were agreed on: <gl> <gt>ANY <gd>The word is compatible with (unifies with) all values <!-- Parenthetical inserted. -TL --> of the feature. <gt>DEFAULT <gd>The feature takes some specific default value, which can be inferred by an analysis, but the value thus found is not stated here. <gt>? <gd>The feature value is unknown. <gt>N/A <gd>The feature does not apply (has no value). <gt>NO CLAIM<gd>No claim is made about the feature's value or applicability. </gl> Simple omission of a feature from a structure may legitimately be used for the senses <term>no claim</term>, <term>n/a</term>, or <term>?</term> but should not be used for <term>any</term> or <term>default</term>. The value <term>no claim</term> should not be used if it is known whether the feature is applicable or not (use <term>?</term> or <term>n/a</term> instead). <!> <h2>3.2 Feature System Declaration <p> GFS pointed out that unknown values and inapplicable features could be unambiguously determined without use of <term>?</term> and <term>n/a</term> values, if it were possible to specify what features are applicable under what circumstances. <!-- Insert the following. -TL --> He also pointed out that such a specification would give specific interpretation to <tag>f.s.not</tag>. For example, if the legitimate values for <term>case</term> are <term>nominative</term>, <term>dative</term> and <term>accusative</term>, then the interpretation of: <xmp> <![ CDATA [ <feature name=case> <f.s.not><atomic>nominative</atomic></f.s.not> </feature> ]]> </xmp> is equivalent to that of: <xmp> <![ CDATA [ <feature name=case> <f.s.or><atomic>dative</atomic> <atomic>accusative</atomic> </f.s.or> </feature> ]]> </xmp> <!-- End of insertion. -TL --> It appears that the minimum function required is <ol> <li>a specification of legal feature names and legal values for them of the logical form <xmp font=text> F1 = (a | b | c | d | ...) [ x ] </xmp> (where F1 is a feature name, a ... d are legal values, and x is the optional default value) <li>a specification of what features (and what subset of their values) are applicable under what circumstances, of the logical form <xmp font=text> F1 [= v] --> F2 [= (a | b | ...)] [x] & F3 [= (d | e | ...)] [y] ... </xmp> (where F1, F2, and F3 are feature names, a, b, d, e are legal values for F2 and F3 if F1 is present with the value v and x and y are default values for F2 and F3 if F1 = v. The specification of v is optional; if omitted, the implication holds for any value specified for F1. The specification of the legal ranges and defaults for F2 and F3 is also optional; if omitted, they are taken from the global specifications of the first form of feature system declaration. <fn> N.B. the forms shown for the specifications are solely to illustrate the abstract syntax required: the first form requires a feature name, a range of values, and a default; the second a set of feature names, value ranges, and defaults. The concrete syntax used for the example was improvised for discussion and is not a proposal for the concrete syntax of the feature system declaration, which should have other material present (e.g. prose documentation) and should take the form of an SGML document. The writing system declaration already defined by the TEI gives a good example of what is needed. <!-- I replaced the word 'expected' by 'needed'. -TL --> </fn> </ol> Other additional specifications (global values, non-enumerated value types, etc.) may prove desirable as well. After discussion, GFS agreed to draft a document AI1 W3 on feature system declarations, covering at least the minimal specifications and any enhancements he finds useful and simple. <action> <who>GFS <act>to draft AI1 W3 on Feature System Declarations <duedate>asap </action> <!> <h2>3.3 DTD Mechanisms <p> The overall mechanism for inclusion of linguistic analysis in a document was reviewed. GFS proposed that the element <tag>linguistic.analysis</tag> (in all its spelling variations) be omitted in favor of a disjunction of the various types of material it may contain (<tag>f.struct</tag>, <tag>forest</tag>, <tag>unit</tag>, and <tag>alignment</tag>). This was agreed to. <!-- Might help to spell out the parameter entity definition for --> <!-- %linguistic.analysis. I'd do it myself if I knew how and Eanass --> <!-- isn't around to help me out. -TL --> <!-- Agreed. -MSM --> Thus the element declaration of TEI P1 version 1.1: <xmp> <![ CDATA [ <!ELEMENT ling.analysis - - (f.struct | alignment | forest | unit)* > ]]> </xmp> should be replaced by a corresponding parameter entity declaration: <xmp> <![ CDATA [ <!ENTITY % ling.analysis "f.struct | alignment | forest | unit" > ]]> </xmp> and the element declaration for <tag>text</tag> should be modified to include the parameter entity <term>ling.analysis</term>, not the element <tag>ling.analysis</tag>, in its list of inclusion exceptions: <xmp> <![ CDATA [ <!ELEMENT text - - (front?, body, back?) +(%f.empty; | %ling.analysis; | var | app | %rendition;) > ]]> </xmp> <!> <h2>3.4 Mixed-content Models and Other Technicalities <p> Owing to SGML's parsing rules, mixed-content models such as that specified in TEI P1 version 1 for <tag>f.struct</tag> are interpreted in a way that prohibits white space from appearing in many places where it might be desired for clear presentation. The examples in TEI P1, as a result, are not legal SGML as they stand, but become legal if all excess white space is suppressed. <p> The following changes were agreed to: <p> The element <tag>word</tag> should be suppressed (it survives in TEI P1 version 1 only through editing errors) and in its place an element <tag>atomic</tag> should be defined as part of the parameter entity <term>f.value.simple</term>. <tag>Atomic</tag> values should accept #PCDATA (or the parameter entity <term>%broth</term> -- i.e. parsed <!-- Don't you mean %broth? instead of %soup? -TL --> <!-- Right. We probably don't really need lists or crystals -MSM --> character data interspersed with optional phrase-level elements like emphasis or quotation) as their content. <p> The elements <tag>f.name</tag> and <tag>f.struct.name</tag> should be suppressed. In their stead, the elements <tag>feature</tag> and <tag>f.struct</tag> should have an additional attribute defined for their name. It should accept character-data values and require no value, and thus may be defined: <xmp> <![ CDATA [ <!ATTLIST (feature, f.struct) %global.attributes; name CDATA #IMPLIED > ]]> </xmp> At this point, GFS departed; TL and MSM continued the subgroup's work. <!> <h2>3.5 Ambiguity and Its Resolution <p> The current draft of TEI P1 is contradictory in its treatment of ambiguity resolution (as pointed out in Jan Hajic's comments). Two mechanisms appear: a <att>path</att> attribute which is said to appear on the <tag>f.ptr</tag> element (but is not defined in the DTD) and an element <tag>f.s.choice</tag> which is used in examples as content for <tag>f.ptr</tag> elements but is not defined. <p> It was agreed: <ol> <li>The <att>path</att> attribute should be renamed <att>choice</att> and added to the DTD. <li>The <att>choice</att> attribute should take the declared value <term>IDREFS</term>. <li>The <att>choice</att> attribute should be used on an <tag>f.ptr</tag> element when the <att>target</att> attribute on the same <tag>f.ptr</tag> points to an ambiguous analysis (one containing or constituted by an <tag>f.s.OR</tag>. <li>If the ambiguity is completely resolved, the <att>choice</att> attribute value should be the ID of the chosen analysis. <li>If the ambiguity is partially resolved, the <att>choice</att> attribute value should be a list including the IDs of all analyses still considered possible. <li>If the analysis has multiple disjunctions, the <att>choice</att> value should be a list including the IDs of all chosen interpretations in any order. <li>Any disjunctions which have no daughters included in the <att>choice</att> value are unaffected by the disambiguation. </ol> <p> For example, consider the following feature structure: <!-- Editorial change in preceding line. -TL --> <xmp> [ id = F8 cat = verb OR: (id = F8o1 [ id = F8a1 ... ] [ id = F8a2 ... ] [ id = F8a3 OR: (id = F8a3o [ id = F8a3a1 ... ] [ id = F8a3a2 ... ] ) ] ) OR: (id = F8o2 [ id = F8a4 ... ] [ id = F8a5 ... ] ) OR: (id = F8o3 [ id = F8a6 ... ] [ id = F8a7 ... ] ) </xmp> Or, drawn as a tree: <!-- infl branch removed and value of cat indicated. -TL --> <xmp> F8 | +---------+--------------+--------------+ | | | | cat OR OR OR | id=F8o1 id=F8o2 id=F8o3 verb | | | +----+----+ +--+--+ +--+--+ | | | | | | | F8a1 | F8a2 F8a4 F8a5 F8a6 F8a7 OR id=F8a3o | +----+----+ | | F8a3a1 F8a3a2 </xmp> <p> To continue the example, the specification <tag>f.ptr target='F8' choice='F8a3a1 F8a3o F8a2 F8a4'</tag> in an analysis would indicate that the disjunction F8a3o is resolved in favor of its left-hand child F8a31, and F8o2 is resolved in favor of F8a4. The disjunction F8o1 is partially resolved by the elimination of F8a1 (and with the disambiguation of F8a3o, F8o1 is now effectively an ambiguity between F8a3a1 and F8a2). F8o3 is unaffected and remains ambiguous. <p> As a second example, note that if the nested disjunction F8a3o were omitted from the <att>choice</att> value, leaving 'F8a3a1 F8a2 F8a4', then F8o1 would be completely disambiguated, because only one of its children is listed, namely F8a2. Since F8a3o would not be kept, its disambiguation would be purely of academic interest. <p> An application which wished to prune the tree of possible interpretations may follow the algorithm: <ol> <li>flag all nodes listed in the <att>choice</att> attribute value. These should all be immediate daughters of an <tag>f.s.OR</tag> element; if not, the <att>choice</att> value is invalid. <li>check the siblings of each flagged feature structure: if they are unflagged, then delete them. <li>check each <tag>f.s.OR</tag>: if it has exactly one daughter remaining, then <ol> <li>delete the <tag>f.s.OR</tag> itself, and <li>promote the remaining daughter to the position of the mother. </ol> <!-- The preceding two steps replace the following 3. Made the --> <!-- change because the old mechanism treats nested ambiguities --> <!-- one way if they are nested directly in an outer ambiguity --> <!-- and another way if they are nested deeply within one --> <!-- branch of the outer disjunction. We now never flag ORs; --> <!-- instead we look at all of them for single flagged --> <!-- daughters. --> <!-- --> <!-- < li >flag the parent of each chosen node. (Assert: the --> <!-- parent is an < tag >f.s.OR< /tag >.) --> <!-- < li >within each flagged f.s.OR , delete every --> <!-- daughter node which is not flagged. --> <!-- < li >check each flagged < tag >f.s.OR< /tag >: if it has --> <!-- only one daughter, delete the < tag >f.s.OR< /tag > node --> <!-- and promote the daughter to its position. --> </ol> In the first example given, flagging the chosen nodes <!-- and their parent nodes --> leaves nodes F8a3a1, F8a3o, F8a2, and F8a4 flagged. Weeding the unflagged siblings of flagged analyses removes F8a1, F8a3a2, and F8a5 from the tree. Promotion of only daughters results in elimination of F8a3o and F8o2 from the tree. The pruned tree would look like this: <!-- infl removed as before. -TL --> <xmp> F8 | +---------+--------------+--------------+ | | | | cat OR F8a4 OR | id=F8o1 id=F8o3 verb | | +----+----+ +--+--+ | | | | F8a3a1 F8a2 F8a6 F8a7 </xmp> <p> In the second example, with F8a3o omitted from the <att>choice</att> value, only F8a3a1, F8a2, and F8a4 would be flagged. F8a3a2, F8a1, F8a3o, and F8a5 would be removed as unflagged siblings of flagged analyses. F8o1 would be replaced by F8a2, F8o2 by F8a4. If the tree is processed top-down, some entire branches may be removed before they are processed: in this case, the removal of F8a3o would eliminate the need to search for siblings of F8a3a1. The pruned tree would look like this: <xmp> F8 | +---------+--------------+--------------+ | | | | cat F8a2 F8a4 OR | id=F8o3 verb | +--+--+ | | F8a6 F8a7 </xmp> <!-- Add the following. -TL --> <p> A similar mechanism may be desirable for the explicit disambiguation of representations using the <tag>tree</tag> and <tag>subtrees</tag> notation. TL will look into this. <p> Finally, it should be noted that the mechanism of explicitly marking disambiguation is particularly useful if a text is associated with a lexicon in which words are organized under lemmata, in which the various interpretations are grouped using <tag>f.s.OR</tag>. If a lexical occurrence of a word is disambiguated by its context, then the foregoing mechanism provides a straightforward way of pointing to the intended interpretation. <!-- End of insert. -TL --> <!> <h2>3.6 Interface to TR Work <p> The existing TEI tags <tag>formula</tag>, <tag>foreign</tag>, <tag>number</tag> seem to be usable for the objects they describe and so no provision was found necessary for marking these features using the feature structure notation or for including feature structure notations for them in the <q>starter set</q>. <!> <!> <h1>4 Unfinished Business <p> The following agenda items were not taken up for lack of time. <!-- Editorial change in preceding line. -TL --> <ul> <li>degree of certainty in resolving ambiguity <li>simplifications for special grammars <li>sample taggings </ul> </body> <!> </gdoc>