.sr docfile = &sysfnam. ;.sr docversion = 'Draft';.im teigmlp1 .* Document proper begins. .sr docdate '2 December 1991' The construction of pointers for lexical encoding <author>D. Terence Langendoen <docnum>TEI &docfile. <date>&docdate. </titlep> <!> </frontm> <!> <body> <h1>Introduction <h2>Background <p>In <cit>TEI AI1 W9</cit>, Eanass Fahmy and I proposed a method for constructing and using entities for representing the morphological structures of the words in a document. At the Myrdal meeting, Steven DeRose suggested that using pointers would provide for more control over validation, make it easier to make changes and corrections to marked-up texts, and speed up parsing and processing generally. This suggestion met with approval at the Myrdal meeting, and I agreed to recast the method in terms of pointers. While I have, in the interests of time, limited the domain of application to the same sets of feature structures, features and values as in <cit>TEI AI1 W9</cit>, I have chosen to use a two-character, rather than a one-character, coding scheme for feature names and values, in anticipation of the need to extend the method to a larger subset of the features and values in <cit>TEI AI1 W2</cit>. I also represent the person and number features associated with verbs as part of a feature structure which is the value of a feature named <term>agreement</term>, as in <cit>TEI AI1 W2</cit>. <h2>Various assumptions about certain elements and attributes <p>In this document, the following elements and attributes have been renamed as follows. <ol> <li><tag>f.struct</tag> becomes <tag>fs</tag> <li><tag>feature</tag> becomes <tag>f</tag> <li><tag>atomic</tag> becomes <tag>atm</tag> <li><tag>xref</tag> becomes <tag>x</tag> <li><tag>f.s.or</tag> becomes <tag>or</tag> <li><term>target</term> becomes <term>t</term> </ol> <p>The following new tags and attributes are used. <ol> <li><tag>fs.lib</tag> <li><tag>f.lib</tag> <li><tag>any</tag> <li><tag>default</tag> <li><tag>unknown</tag> <li><tag>not.applicable</tag> <li><tag>no.claim</tag> <li><term>element</term>, an attribute of <tag>x</tag>, <tag>and</tag>, <tag>not</tag> and <tag>or</tag>, which takes as values the names of elements. </ol> <p>I propose <tag>fs.lib</tag> and <tag>f.lib</tag> to group together in <term>libraries</term> sets of predefined <tag>fs</tag>s and <tag>f</tag>s, with <term>id</term>s that can be pointed to from the text or from within the libraries. I propose <tag>any</tag>, <tag>default</tag>, <tag>unknown</tag>, <tag>not.applicable</tag> and <tag>no.claim</tag> as empty tags to designate the predefined values for underspecification spelled out in <cit>TEI AI1 W2</cit>. <h1>Conventions for forming ID values <h2>Two-character coding scheme for feature names and values <p>First I give a two-character code for each of the feature names and values in <cit>TEI AI1 W9, Coding of word-level features and values, Coding of recurrent features and values, and Feature and value codes for single categories</cit>. <dl tsize=25> <dthd>feature name or value <ddhd>code <dt>category (part-of-speech) <dd>ps <dt>noun <dd>nn <dt>pronoun <dd>pn <dt>adjective <dd>aj <dt>article <dd>ar <dt>adverb <dd>av <dt>verb <dd>vb <dt>preposition <dd>pp <dt>coordinator <dd>cd <dt>subordinator <dd>sb <dt>particle <dd>pt <dt>interjection <dd>ij <dt>punctuation <dd>pu <dt>person <dd>pe <dt>first <dd>r1 <dt>second <dd>r2 <dt>third <dd>r3 <dt>number <dd>nu <dt>singular <dd>sg <dt>plural <dd>pl <dt>gender <dd>ge <dt>feminine <dd>fe <dt>masculine <dd>ma <dt>neuter <dd>ne <dt>case <dd>ca <dt>nominative <dd>nm <dt>genitive <dd>gt <dt>dative <dd>dt <dt>accusative <dd>ac <dt>degree <dd>de <dt>positive <dd>g1 <dt>comparative <dd>g2 <dt>superlative <dd>g3 <dt>initial-capital <dd>il <dt>minus <dd>c0 <dt>plus <dd>c1 <!-- < term>common < /term> replaces < term>proper < /term> --> <!-- to simplify the coding scheme --> <dt>common <dd>co <dt>minus <dd>m0 <dt>plus <dd>m1 <dt>possessive <dd>po <dt>minus <dd>s0 <dt>plus <dd>s1 <dt>agreement <dd>ag <!-- takes an < tag>fs< /tag> as value whose < tag>f< /tag>s are --> <!-- < term>person< /term> and < term>number< /term> --> <dt>first singular <dd>r1sg <dt>second singular <dd>r2sg <dt>third singular <dd>r3sg <dt>first plural <dd>r1pl <dt>second plural <dd>r2pl <dt>third plural <dd>r3pl <dt>tense <dd>te <dt>present <dd>pr <dt>past <dd>pa <dt>mood <dd>md <dt>indicative <dd>in <dt>imperative <dd>im <dt>subjunctive <dd>sj <dt>verb-form <dd>vf <dt>finite <dd>fi <dt>infinitive <dd>nf <dt>gerund <dd>gr <dt>participle <dd>pc <dt>auxiliary <dd>au <dt>minus <dd>x0 <dt>plus <dd>x1 <dt>orientation <dd>or <dt>open <dd>op <dt>close <dd>cl <dt>matched <dd>mt <dt>unary <dd>un </dl> <p>Second, the underspecification tags are given the following codes. <dl tsize=35> <dthd>feature name or value <ddhd>code <dt>any <dd>00 <dt>default <dd>01 <dt>unknown <dd>97 <dt>not.applicable <dd>98 <dt>no.claim <dd>99 </dl> <h2>Formation of identifiers <h3>Identifiers for Fs <p>Identifiers for <tag>f</tag>s are formed as in <cit>TEI AI1 W2</cit> by joining the code for the <term>name</term> and the <term>value</term>, but without the hyphen. For example, the <tag>f</tag> whose name is <term>number</term> and whose value is <term>singular</term> receives the identifier <term>nusg</term>, as in the following illustration. <xmp><![ CDATA [ <f id=nusg name=number><atm>singular ]]> </xmp> <p>Here is another example, this time using an underspecified value. <xmp><![ CDATA [ <f id=nu01 name=number><default> ]]> </xmp> <h3>Identifiers for FSs <p>Identifiers for <tag>fs</tag>s are formed as in <cit>TEI AI1 W2</cit> by joining the codes for the values of the enclosed <tag>f</tag>s. <fn>However, if an underspecified value is used, the code for the name and the second digit of the value must be used if the underspecified value is to appear in the <tag>fs</tag> identifier.</fn> For example, the <tag>fs</tag> that designates a singular common noun with no initial capital and which is understood by default as third person (such as <cited>antiquity</cited>), can receive the identifier <term>nnsgm1ilc0</term>, as in the following illustration. <fn>I assume that no coding for the default person value is necessary in this situation.</fn> <xmp><![ CDATA [ <fs id=nnsgm1ilc0> <x f t=psnn> <x f t=nusg> <x f t=com1> <x f t=ilc0> <x f t=pe01> </fs> ]]> </xmp> <p>As the preceding illustration shows, the <tag>fs</tag> encloses <tag>x</tag>s which point to the <tag>f</tag>s that they comprise. <fn>The <term>f</term> value is the value of the <term>element</term> attribute of <tag>x</tag>.</fn> The intended semantics is that the <tag>f</tag>s that are pointed to are copied (except for their <term>id</term> attributes) into the positions of the <tag>x</tag>s. The <tag>x</tag>s could be given their own <term>id</term> attributes if desired. <p>The situation with <tag>f</tag>s which have <tag>fs</tag>s as values, is somewhat more complicated. Suppose that a <tag>fs</tag> has an <tag>f</tag> whose name is <term>agreement</term> whose value is another <tag>fs</tag>, which encloses two <tag>f</tag>s, one whose name is <term>person<term> with the value <term>third</term> and the other whose name is <term>number</term> with the value <term>singular</term>. The outer <tag>fs</tag> will have an identifier with the sequence <term>r3sg</term> in it, and will contain an <tag>x</tag> which points to an <tag>f</tag> with the identifier <term>agr3sg</term>. The latter <tag>f</tag> will contain an <tag>fs</tag> with the identifier <term>r3sg</term>, and this <tag>fs</tag> will contain two <tag>x</tag>s, one pointing to an <tag>f</tag> with the identifier <term>per3</term> and the other pointing to an <tag>f</tag> with the identifier <term>nusg</term>. The situation is illustrated as follows. <xmp><![ CDATA [ <fs id=...r3sg...> <x f t=...> ... <x f t=agr3sg> -------------+ <x f t=...> | ... | </fs> | | <f id=agr3sg name=agreement> <--+ <x fs t=r3sg> --------+ | <fs id=r3sg> <-----------+ <x f t=per3> ------------+ <x f t=nusg> ------------x---+ </fs> | | | | <f id=per3 name=person> <---+ | <atm>third | | <f id=nusg name=number> <-------+ <atm>singular ]]> </xmp> <p>For example, if the structure to be encoded is that of a present tense, indicative, auxiliary verb showing third person singular number agreement, such as <cited>has</cited> or <cited>is</cited>, the incomplete <tag>fs</tag> in the preceding illustration could be fleshed out as follows. <xmp><![ CDATA [ <fs id=vbr3sgprinx1> <x f t=psvb> <x f t=agr3sg> <x f t=tepr> <x f t=mdin> <x f t=vf01> <x f t=aux1> </fs> ]]> </xmp> <h3>Identifiers for ambiguous structures <p>Suppose one wished to mark up an occurrence of <cited>thinner</cited> as ambiguous between an interpretation as a singular noun and a comparative adjective. One way would be as a disjunction of two <tag>x</tag>s with its own identifier formed by concatenation from the identifiers for the <tag>x</tag>s, as in the following illustration; the <tag>x</tag>s in turn point to the <tag>fs</tag>s which make up the choice. <xmp><![ CDATA [ <or x id=nnsgajg2> <x fs t=nnsg> <x fs t=ajg2> </or> ]]> </xmp> <p>A more elaborate example is provided by <cited>fast</cited>, which one might wish to mark up as a singular noun, a verb of some sort, an adjective or an adverb. A possible encoding of this structure is the following. <xmp><![ CDATA [ <or x id=nnsgvbvf9ajg1avg1> <x fs t=nnsg> <x fs t=vbvf9> <x fs t=ajg1> <x fs t=avg1> </or> ]]> </xmp> <p>Next suppose one wished to mark up an occurrence of <cited>was</cited> as showing either first or third person agreement. The <tag>or</tag> can be used to group either <tag>atm</tag> or <tag>f</tag>, the latter requiring us to repeat the name of the <tag>f</tag>. Since we have not seen the need to provide identifiers for <tag>atm</tag>s (though it is certainly possible to do so), we will take the latter option, despite its wordiness. First, we construct the value of the <term>agreement</term> <tag>f</tag>. <xmp><![ CDATA [ <fs id=r1r3sg> <or x id=per1per3> <x f t=per1> <x f t=per3> </or <x f t=nusg> </or> ]]> </xmp> The <term>agreement</term> <tag>f</tag> in turn looks as follows. <xmp><![ CDATA [ <f id=agr1r3sg name=agreement> <x fs t=r1r3sg> ]]> </xmp> <p>A possible encoding of the structure for <cited>was</cited> would then be the following. <xmp><![ CDATA [ <fs id=vbr1r3sgpainx1> <x f t=psvb> <x f t=agr1r3sg> <x f t=tepa> <x f t=mdin> <x f t=vf01> <x f t=aux1> </fs> ]]> </xmp> <p>Next, we consider how an encoding for <cited>have</cited> as a present-tense inflected form can be constructed. We assume that the agreement structure represents a choice between any person and plural number or any number and either first or second person. The first of these agreement structures has the following form. <xmp><![ CDATA [ <fs id=pe0pl> <x f t=pe00> <x f t=nupl> </fs> ]]> </xmp> The second structure has the following form. <xmp><![ CDATA [ <fs id=r1r2nu0> <or x t=per1per2> <x f t=per1> <x f t=per2> </or> <x f t=nu00> </fs> ]]> </xmp> <p>The agreement structure, then is an <tag>or</tag> as follows. <xmp><![ CDATA [ <or x id=pe0plr1r2nu0> <x fs t=pe0pl> <x fs t=r1r2nu0> </or> ]]> </xmp> <p>It is pointed to by an <term>agreement</term> <tag>f</tag> as follows. <xmp><![ CDATA [ <f id=agpe0plr1r2nu0 name=agreement> <x or t=pe0plr1r2nu0> </or> ]]> </xmp> <p>The entire structure is as follows. <xmp><![ CDATA [ <fs id=vbpe0plr1r2nu0prinx1> <x f t=psvb> <x f t=agpe0plr1r2nu0> <x f t=tepr> <x f t=mdin> <x f t=vf01> <x f t=aux1> </or> ]]> </xmp> <p>Finally, we consider the encoding of <cited>have</cited> as ambiguous between an indicative, imperative, subjunctive and infinitive form. As an indicative form, it has the structure just given. As an imperative form, it can be assumed to show second person, any number agreement. As a subjunctive form, it can be assumed to show agreement with any person and number. As an infinitive form, agreement is not applicable. We give here just the encoding of the outer <tag>or</tag>. The details of the encoding can actually be inferred from the structure of the identifiers. The character <q>.</q> is used to separate the codes understood to be joined by <tag>and</tag>, and the identifier for an <tag>and</tag> ends in that character. <xmp><![ CDATA [ <fs id= vb.pe0p1r1r2nu0prin.r2nu0prim.pe0nu0prsj.ag8te8md8nf.x1> <x f t=psvb> <or and id=pe0plr1r2nu0prin.r2nu0prim.pe0nu0prsj.ag8te8md8nf.> <and x id=vbpe0p1r1r2nu0prin.> <x f t=agpe0plr1r2nu0> <x f t=tepr> <x f t=mdin> <x f t=vf01> </and> <and x id=r2nu0prim.> <x f t=agr2nu0> <x f t=tepr> <x f t=mdim> <x f t=vf01> </and> <and x id=pe0nu0prsj.> <x f t=agpe0nu0> <x f t=tepr> <x f t=mdsj> <x f t=vf01> </and> <and x id=ag8te8md8nf.> <x f t=ag98> <x f t=te98> <x f t=md98> <x f t=vfnf> </and> </or> <x f t=aux1> </fs> ]]> </xmp> <p>To avoid the use of <tag>and</tag> in cases like this, one could require the use of an <tag>f</tag> named, say, <term>inflection</term> (code <term>if</term>), whose value is an <tag>x</tag> which points to a <tag>fs</tag> which encloses pointers to the various <tag>f</tag>s which are represented by inflectional morphology, such as <term>agreement</term>, <term>tense</term>, <term>mood</term>, <term>verb-form</term>, <term>number</term>, <term>person</term>, etc. Schematically, the result would look as follows. <xmp><![ CDATA [ <fs id= vbpe0plr1r2nu0prinr2nu0primpe0nu0prsjag8te8md8nfx1> <x f t=psvb> <x f t=ifpe0plr1r2nu0prinr2nu0primpe0nu0prsjag8te8md8nf> ------------+ <x f t=aux1> | </fs> | | <f id=ifpe0plr1r2nu0prinr2nu0primpe0nu0prsjag9te9md9nf | name=inflection> <---------------+ <x or t=pe0plr1r2nu0prinr2nu0primpe0nu0prsjag8te8md8nf> -----+ | <or x id=pe0plr1r2nu0prinr2nu0primpe0nuprsjag8te8md8nf> <------+ <x fs t=pe0p1r1r2nu0prin> ---+ <x fs t=r2nu0prim> ----------x---+ <x fs t=pe0nu0prsj> ---------x---x---+ <x fs t=ag8te8md8nf> --------x---x---x---+ </or> | | | | | | | | <fs id=pe0plr1r2nu0prin> <-----+ | | | <x f t=agpe0plr1r2nu0> | | | <x f t=tepr> | | | <x f t=mdin> | | | <x f t=vf01> | | | </fs> | | | | | | <fs id=r2nu0prim> <----------------+ | | <x f t=agr2nu0> | | <x f t=tepr> | | <x f t=mdim> | | <x f t=vf01> | | </fs> | | | | <fs id=pe0nu0prsj> <-------------------+ | <x f t=agpe0nu0> | <x f t=tepr> | <x f t=mdsj> | <x f t=vf01> | </fs> | | <fs id=ag8te8md8nf> <----------------------+ <x f t=ag98> <x f t=te98> <x f t=md98> <x f t=vfnf> </fs> ]]> </xmp> <p>Systematic use of the <term>inflection</term> <tag>f</tag> would significantly change the other encoding proposals made here, however. <h1>Libraries of FSs and Fs <p>The <tag>fs</tag>s and the <tag>or</tag>s that group pointers to <tag>fs</tag>s, can be collected together in one place either as an external feature-structure library, or internally collected within an <tag>fs.lib</tag>. The <tag>f</tag>s and the <tag>or</tag>s that group pointers to <tag>f</tag>s, can be collected together in a similar way; the internal collection occurring within an <tag>f.lib</tag>. <h1>Sample text markup <p>Here is a markup of the first sentence of <cit>Mary Robinson, Thoughts on the Condition of Women</cit> corresponding to what appears in <cit>TEI AI1 W9, Sample document instance showing lexical markup with implied word lemmatization</cit>. In the following illustration, the lemmatization is made explicit by means of <tag>s</tag>s. The <tag>hilited</tag>s and <tag>lb</tag>s have been omitted. <xmp><![ CDATA [ <s id=w1>Custom<x fs t=nnsgm1c1></s> <s id=w2>, <x fs t=puun></s> <s id=w3>from <x fs t=pp></s> <s id=w4>the <x fs t=ar></s> <s id=w5>earlie&st; <x fs t=avg3></s> <s id=w6>periods; <x fs t=nnplm1c0></s> <s id=w7>of <x fs t=pp></s> <s id=w8>antiquity<x fs t=nnsgm1c0></s> <s id=w9>, <x fs t=puun></s> <s id=w10>has <x fs t=vbr3sgprinx1></s> <s id=w11>endeavored <x fs t=vbpapcx0></s> <s id=w12>to <x fs t=sb></s> <s id=w13>place <x fs t=vbnfx0></s> <s id=w14>the <x fs t=ar></s> <s id=w15>female <x fs t=ajde8></s> <s id=w16>mind <x fs t=nnsgm1c0></s> <s id=w17>in <x fs t=pp></s> <s id=w18>the <x fs t=ar></s> <s id=w19>subordinate <x fs t=ajde8></s> <s id=w20>rank <x fs t=nnsgm1c0></s> <s id=w21>of <x fs t=pp></s> <s id=w22>intellectual <x fs t=ajde8></s> <s id=w23>sociability<x fs t=nnsgm1c0></s> <s id=w24>. <x fs t=puun></s> ]]> </xmp> <p>The <tag>fs.lib</tag> and <tag>f.lib</tag> remain to be provided. </body> <!> <appendix> <h1>List of noncompound feature names and values by code <dl tsize=25> <dthd>feature name or value <ddhd>code <dt>any <dd>00 <dt>default <dd>01 <dt>unknown <dd>97 <dt>not.applicable <dd>98 <dt>no.claim <dd>99 <dt>accusative <dd>ac <dt>agreement <dd>ag <dt>adjective <dd>aj <dt>article <dd>ar <dt>auxiliary <dd>au <dt>adverb <dd>av <dt>initial-capital minus <dd>c0 <dt>initial-capital plus <dd>c1 <dt>case <dd>ca <dt>coordinator <dd>cd <dt>close <dd>cl <dt>common <dd>co <dt>degree <dd>de <dt>dative <dd>dt <dt>feminine <dd>fe <dt>finite <dd>fi <dt>positive <dd>g1 <dt>comparative <dd>g2 <dt>superlative <dd>g1 <dt>gender <dd>ge <dt>gerund <dd>gr <dt>genitive <dd>gt <dt>interjection <dd>ij <dt>initial-capital <dd>il <dt>imperative <dd>im <dt>indicative <dd>in <dt>common minus <dd>m0 <dt>common plus <dd>m1 <dt>masculine <dd>ma <dt>mood <dd>md <dt>matched <dd>mt <dt>neuter <dd>ne <dt>infinitive <dd>nf <dt>nominative <dd>nm <dt>noun <dd>nn <dt>number <dd>nu <dt>open <dd>op <dt>orientation <dd>or <dt>past <dd>pa <dt>participle <dd>pc <dt>person <dd>pe <dt>plural <dd>pl <dt>pronoun <dd>pn <dt>possessive <dd>po <dt>preposition <dd>pp <dt>present <dd>pr <dt>category (part-of-speech) <dd>ps <dt>particle <dd>pt <dt>punctuation <dd>pu <dt>first <dd>r1 <dt>second <dd>r2 <dt>third <dd>r1 <dt>possessive minus <dd>s0 <dt>possesive plus <dd>s1 <dt>subordinator <dd>sb <dt>singular <dd>sg <dt>subjunctive <dd>sj <dt>tense <dd>te <dt>unary <dd>un <dt>verb <dd>vb <dt>verb-form <dd>vf <dt>auxiliary minus <dd>x0 <dt>auxiliary plus <dd>x1 </dl> </appendix> </gdoc>