Feature-structure markup for presentation at Oxford and Brown workshops D. Terence Langendoen Eanass Fahmy Department of Linguistics University of Arizona Tucson, AZ 85721 USA Email: langendt@arizona.edu September 25, 1991 Document Number: TEI AI1 W9 ABSTRACT This document is a conflation of the material presented by Terry Lan- gendoen at the European TEI Workshop that was held at Oxford University in June 1991, and by Eanass Fahmy at the North American TEI Workshop that was held at Brown University in July 1991. Our task was to show how to use the "feature structure" markup proposals of the TEI Guide- lines (we actually used a slightly revised version of those proposals that grew out of the Lexical Encoding Working Group meeting in Baltimore in January 1991) for the encoding of lexical items (words) in running text, and to demonstrate the applicability of the feature structure markup conventions for other than strictly linguistic purposes. 1 ENCODING USING FEATURE STRUCTURES The most important tagging mechanism proposed in Chapter 6, "Analytic and Interpretive Information", of the TEI guidelines, TEI P1, Version 1.1, is the one proposed for feature structures. A feature structure may be thought of as a bundle of one or more features, each of which may have a name, and each of which must have a value. The value may itself be another feature structure. Thus feature structures are recursive in nature. Since the preparation of TEI P1, the TEI Analysis and Interpre- tation Committee has recommended certain changes in the encoding of fea- ture structures. First, the name of a feature is no longer enclosed within a tag of its own but is instead the value of the name attribute of the feature tag. Second, if the value of a feature is not a struc- tured object (e.g., another feature structure), then it is enclosed in an atomic tag. Thus, the sample feature structure markup provided on page 130 of TEI P1 would now look as follows. ... noun ... The changes in the DTD for linguistic markup that support these changes in particular mean that feature-structure markup can be entered into a document using indentation and line breaks as in this example, thus improving readability. It was always the committee's intention to provide for readability of markup, but the initial recommendations were written without the benefit of an SGML validator to test whether they worked precisely as intended. That defect has since been corrected. One of the most important reasons for providing recommendations for linguistic markup is to provide an interchange mechanism for the various linguistic encoding schemes, particularly for lexical markup, that have been developed at various research centers around the world. The feature-structure mechanism proposed in TEI P1 is both general and flex- ible enough to accommodate any current, and we believe, any conceivable linguistic markup scheme. The generality and flexibility of the mecha- nism also means that it can be used for the encoding of textual informa- tion of a not strictly grammatical nature as well. Indeed, as we now argue, it is a kind of universal markup language that can be tailored for just about any data analysis and retrieval purpose imaginable. Features and feature structures are like records and fields Note: The material in this section, including that in "More articu- lated feature-structure representation" through "Textual references to feature structures using the XREF tag" were presented by Langendoen at the Oxford workshop. Some minor post-workshop corrections have also been made. The sample markup assumes SHORTTAG has been set on. The tag can be thought of as a data-base record, where nnn is the record number. The tag can be thought of as a data-base field, where xyz specifies the name of the field. For example, consider the simple data-base record structure in Figure 1. +----------------------------------------------------------------------+ | | | Rec-No 1 | | Last-Name O'Connor | | First-Name Janet | | Middle-Initial R | | Home-Phone 301-555-8639 | | Birthday 26 November | | | | Figure 1: Simple data-base record | | | +----------------------------------------------------------------------+ This record can be "translated" into the feature structure in Figure 2. +----------------------------------------------------------------------+ | | | | | | | O'Connor | | | | Janet | | | | R | | | | 301-555-8639 | | | | 26 November | | | | | | Figure 2: Feature-structure translation of data-base structure | | | +----------------------------------------------------------------------+ More articulated feature-structure representation Because features can have feature structures as values, we can create more highly articulated representations of the information in Figure 2, as shown in Figure 3. +----------------------------------------------------------------------+ | | | | | | | | | | | O'Connor | | | | Janet | | | | R | | | | | | | | | | | | 301 | | | | 555-8639 | | | | | | | | | | | | 28 | | | | November | | | | | | | | | | Figure 3: More articulated feature-structure translation | | | +----------------------------------------------------------------------+ Further articulation of feature-structure representation The representation in Figure 3 can be embedded, for example, in a list of similar representations, as shown in Figure 4. +----------------------------------------------------------------------+ | | | ... | | | | | | | | | | | | | | | | | | | | | | | | | | Figure 4: Further articulation of feature-structure representa- | | tion | | | +----------------------------------------------------------------------+ Textual references to feature structures using the XREF tag Given structures containing information about individuals such as Janet O'Connor and Fred Saunders, cross-references (using xref) to those structures can be placed in a text file next to occurrences of expressions that refer to those individuals. +----------------------------------------------------------------------+ | | | Dear Janet: | | | | Happy Birthday! | | | | Love, | | | | Fred | | | | Figure 5: Use of XREF pointers to feature structures in a text | | file | | | +----------------------------------------------------------------------+ Encoding an almanac fragment using feature structures Note: This material was presented by Eanass Fahmy at the Brown work- shop. The following fragment shows how to encode information contained in an almanac using f.struct
Excerpts from World Exploration and Geography taken from The World Almanac and Book of Facts (1991), using TEI tagging. The World Almanac and Book of Facts, 1991. Transcribed and marked up July 1991 by Eanass Fahmy.
World Exploration and Geography Early Explorers of the Western Hemisphere

The first men to discover the New World or Western Hemisphere are believed to have walked across a land bridge from Siberia to Alas- ka, an isthmus since broken by the Bering Strait. From Alaska, these ancestors of the Indians spread through North, Central, and South America. Anthropologists have placed crossings at between 18,000 and 14,000 B.C.; but evidence found in 1967 near Puebla, Mexico, indicates mankind reached there as early as 35,000-40,000 years ago.

At first, these people were hunters using flint weapons and tools. In Mexico, about 7000-6000 B.C., they founded farming cultures, devel- oping corn, squash, etc. 1497 John Cabot Italian-English Nova Scotia 1501 Rodrigo de Basti- das Spanish South America 1609 Henry Hudson English-Dutch Hudson River, Hudson Bay Problems with using such a general-purpose mechanism for text analysis There is not much point in using the feature-structure mechanism for text analysis and interpretation unless: 1. the information so represented can be easily accessed in ways that are of use and interest to scholars; 2. the process of entering and modifying the entry of data into those structures can be made simple and reliable. The solution to problem "1. " depends on the development of applica- tions software which "understands" TEI-conformant markup, and is outside the scope of the TEI project as currently constituted. On the other hand, since some of the problems with data entry can be taken care of within SGML by providing easy-to-use entity definitions for complex fea- ture structures, the solution to those problems is within the scope of the project. We now illustrate the process of developing a way of cre- ating entity definitions for lexical markup, in which the entities func- tion as abbreviations for feature-structure representations with a uni- form interpretation. 2 USING FEATURE-STRUCTURE MARKUP FOR LEXICAL ENCODING Recommendations in the document TEI AI1W2 for lexical encoding Also on page 130 of TEI P1, Version 1.1, one reads: Work is in progress on a set of entity definitions for com- monly used parts of speech and other grammatical information of the sort commonly recorded in tagged corpora. In January, a working committee of the Analysis and Interpretation Committee met in Baltimore to come up with a list of grammatical fea- tures found in the nine official languages of the European Community plus Russian. The committee decided to limit the features to those that can reasonably be said to be morphologically marked in at least one of those languages. It was agreed that these features should be used as the basis for developing the initial or starter set of entity defini- tions referred to in the above quotation. The working committee released on 24 May the document TEI AI1W2, which contains a suggested list of features for the ten languages in question. The features are organized into three types, as follows. * word-level features (features that may be associated with any lexi- cal item, regardless of its type) * recurrent features (features that are associated with lexical items of various types, but not with all lexical items) * individual features (features that are associated with lexical items of exactly one type) By type, we meant essentially the traditional notion of part of speech or word class. After some debate, we settled on the name catego- ry, which we construed as the name of one of the word-level features, whose value may be any of the parts of speech or word classes of any of the languages in question. The committee also recognized the need to allow encoders to underspe- cify the values of particular features, and proposed a mechanism for doing that, which includes identifying a designated set of values for underspecification, and enabling the encoder to propose a set of feature structure declarations that specify exactly what combinations of fea- tures and feature values are allowable in the document under analysis; see TEI AI1W3 for details. So far, no one has attempted to implement any feature structure declarations. In addition, the feature-structure encoding mechanisms in TEI P1 pro- vide for arbitrary boolean combinations of features, so that one can specify that a particular lexical item is not third-person singular- number or that it is either masculine-gender or neuter-gender. In what follows, we begin the development of a method of constructing entity definitions for lexical encoding based on the work reported in TEI AI1W2. The construction of entity definitions for lexical encoding With each feature value and feature name, we associate an alphanumer- ic code. The codes for feature names must all be distinct, whereas those for feature values need only be distinct among the features asso- ciated with a particular category. That is, within a feature structure containing a feature whose name is category and which has a particular value (e.g. noun), we require that the codes for all the values of the other features be distinct.(1) The codes for the values used for underspecification also cannot be used for any other value. We suggest the following codes be reserved for the underspecification values. How- ever, in these presentations, we do not show how entity definitions can be created for underspecified values. Code Feature value ____ _____________ 6 any 7 default 8 unknown 9 not-applicable 0 no-claim To construct an entity definition for a particular feature-value pair, one concatenates the code for the name of the feature, the delim- iter "-", and the code for the value of the feature. For example, sup- pose C is the code for the feature name category and N is the code for the feature value noun. Then the entity that represents this feature name-value pair is that shown in (1). (1) &C-N; More precisely, this entity represents the structure given in (2). (2) noun Similarly, suppose K is the code for the feature name case and N is the code for the feature value nominative. Then the entity that refers to this feature name-value pair is (3). (3) &K-N; This entity represents the structure given in (4). (4) nominative Since the feature-name codes are all distinct, each name-value combi- nation is associated with a distinct entity. Now suppose we wish to encode the feature structure for a noun whose case is nominative. The full feature-structure representation is shown in (5). (5) noun nominative This structure can be encoded, using the entities &C-N; and &K-N;, as in (6). (6) &C-N;&K-N; Since we have required that the codes for the feature values all be dis- tinct within a given word class, we can represent entire feature struc- tures representing lexical information, such as (6), by entities that are formed by concatenating the value codes only. Thus (6) can be rep- resented by (7). (7) &NN; Similarly, we can represent a more highly specified feature struc- ture, such as that for a singular, feminine, nominative noun, with ini- tial capital and not proper, as in (8), assuming the coding to be pre- sented below. (8) &NSFNU5; The feature name-value space proposed in TEI AI1W2 is enormous, and would require at least a two-character code for each feature name and value. For illustrative purposes, we have substantially pared down the feature name and value combinations that are derivable from TEI AI1W2, so that a reasonably mnemonic one-character alphanumeric code for fea- ture names and values can be provided. As one can see, a very large number of lexical distinctions can still be encoded, using a simple and relatively easy to remember encoding scheme. Coding of word-level features and values The only word-level feature that has been retained as such from TEI AI1W2 is category. The initial-capital feature has been made into a recurrent feature for the categories noun, pronoun and adjective only. The other word-level features have been omitted. Here are the codes for this feature and its values. Code Feature name ____ ____________ C category Code Feature value ____ _____________ N noun R pronoun J adjective A article D adverb V verb P preposition C coordinator S subordinator T particle I interjection K punctuation Coding of recurrent features and values We use the following very small subset of the recurrent features and values provided for in TEI AI1W2. Code Feature name ____ ____________ P person Code Feature value ____ _____________ 1 first 2 second 3 third N number S singular P plural G gender F feminine M masculine T neuter K case N nominative G genitive D dative A accusative D degree 1 positive 2 comparative 3 superlative 3 I initial-capital U plus L minus Feature and value codes for single categories Here are the codes for a small subset of the features and values associated with particular word classes in TEI AI1W2. Nouns Recurrent features (values and codes as in "Coding of recurrent fea- tures and values"): * number * gender * case * initial-capital Individual features and values: Code Feature name ____ ____________ E proper Code Feature value ____ _____________ 4 plus 5 minus There are 540 distinct encodings for nouns using this feature set. Pronouns Recurrent features: * person * number * gender * case * initial-capital Individual features and values: Code Feature name ____ ____________ S possessive Code Feature value ____ _____________ 4 plus 5 minus There are 2,160 distinct encodings for pronouns using this feature set. Adjectives Recurrent features: * number * gender * case * initial-capital * degree There are 720 distinct encodings for adjectives using this feature set. Articles Recurrent features: * number * gender * case There are 60 distinct encodings for articles using this feature set. Adverbs Recurrent feature: * degree There are 4 distinct encodings for adverbs using this feature set. Verbs Recurrent features:(2) * person * number * gender Individual features and values: Code Feature name ____ ____________ T tense Code Feature value ____ _____________ R present A past U future M mood D indicative R imperative J subjunctive V verb-form E finite I infinitive G gerund C participle A auxiliary 4 plus 5 minus There are 10,820 distinct encodings for verbs using this feature set. Prepositions, coordinators, subordinators, particles, interjections All features are associated with these categories in TEI AI1W2 have been omitted. Punctuation Individual features and values: Code Feature name ____ ____________ O orientation Code Feature value ____ _____________ O open C close M matched U unary There are 5 distinct encodings for punctuation marks using this fea- ture set. (1) If we require that all feature-value codes must also be distinct, we simplify the encoding of feature structures when the feature catego- ry is omitted or underspecified. For present purposes, we assume that this feature is never omitted or underspecified. (2) In TEI AI1W2, these features are contained in a feature structure which is the value of the agreement feature. For simplicity, we ignore this nicety here, and simply represent these features direct- ly as features of the main feature structure. ------------------------------------------------------------------------ Appendix A Entity definitions for single feature-value pairs FIRST" > SECOND" > THIRD" > SINGULAR" > PLURAL" > NOMINATIVE" > GENITIVE" > DATIVE" > ACCUSATIVE" > FEMININE" > MASCULINE" > NEUTER" > POSITIVE" > COMPARATIVE" > SUPERLATIVE" > " > " > " > " > " > " > PRESENT" > PAST" > FUTURE" > INDICATIVE" > IMPERATIVE" > SUBJUNCTIVE" > FINITE" > INFINITIVE" > GERUND" > PARTICIPLE" > " > " > OPEN" > CLOSE" > MATCHED" > UNARY" > NOUN" > PRONOUN" > ADJECTIVE" > ARTICLE" > ADVERB" > VERB" > PREPOSITION" > COORDINATOR" > SUBORDINATOR" > PARTICLE" > INTERJECTION" > PUNCTUATION" > ------------------------------------------------------------------------ Appendix B Entity definitions for lexical feature structures needed for sample markup &C-N;"> &C-N;&N-S;&E-5;"> &C-N;&N-P;&E-5;"> &C-N;&N-P;&I-U;&E-4;"> &C-R;"> &C-R;&P-3;&N-S;&G-F;&S-4;"> &C-R;&P-1;&N-S;&K-N;&S-5;"> &C-R;&P-3;&K-N;&S-5;"> &C-R;&P-3;&S-4;"> &C-R;&P-3;&G-T;&S-5;"> &C-R;&P-3;&N-S;&G-T;&S-5;"> &C-J;"> &C-J;&I-U;"> &C-A;"> &C-A;&N-S;"> &C-D;"> &C-D;&D-3;"> &C-V;"> &C-V;&P-3;&N-S;&T-R;&M-D;&V-E;&A-4;"> &C-V;&T-A;&V-C;&A-4;"> &C-V;&T-A;&V-C;&A-5;"> &C-V;&P-3;&N-P;&T-R;&M-D;&V-E;&A-4;"> &C-V;&V-I;&A-4;"> &C-V;&V-I;&A-5;"> &C-V;&T-U;&M-D;&V-E;&A-4;"> &C-V;&P-3;&N-P;&M-D;&V-E;&A-4;"> &C-V;&P-3;&N-S;&M-D;&V-E;&A-4;"> &C-V;&T-R;&M-D;&V-E;&A-4;"> &C-V;&P-3;&N-S;&T-R;&M-D;&V-E;&A-5;"> &C-V;&T-A;&M-D;&V-E;&A-5;"> &C-P;"> &C-C;"> &C-S;"> &C-T;"> &C-I;"> &C-K;"> &C-K;&O-O;"> &C-K;&O-C;"> &C-K;&O-M;"> &C-K;&O-U;"> ------------------------------------------------------------------------ Appendix C Sample document instance showing lexical markup with implied word lemmatizat

First five sentences of Mary Robinson: Thoughts on the Condition of Women: an electronic excerpt in TEI tagging Transcribed by the Brown Women Writers Project, edited for demonstration purposes by C. M. Sperberg-McQueen, and marked up with toy lexical tagging by D. T. Langendoen. Mary Robinson, Thoughts on the Condition of Women, and on the Injustice of Mental Subordination. Second Edition. London: Printed for T. N. Longman, and O. Rees, 1799. Lexical tags capitalized 26 June 1991. One typo corrected; changed analysis of "will". Prepared 21 June 1991 by Terry Langendoen.

Custom&NS5;,&KU; from&P; the&A; earlie&st;&D3; periods&NP5; of&P; antiquity&NS5;,&KU; has&V3SRDE4; endeavoured&VAC5; to&S; place&VI5; the&A; female&J; mind&NS5; in&P; the&A; subordinate&J; ranks&NP5; of&P; intellectual&J; sociability&NS5;.&KU; Woman&NS5; has&V3SRDE4; ever&D; been&VAC4; considered&VAC5; as&P; a&AS; lovely&J; and&C; fascinating&J; part&NS5; of&P; the&A; creation&NS5;,&KU; but&C; her&R3SF4; claims&NP5; to&P; mental&J; equality&N; have&V3PRDE4; not&D; only&D; been&VAC4; ques- tioned&VAC5;,&KU; by&P; envious&J; and&C; interested&J; scep- tics&NP5;;&KU; but&C;,&KU; by&P; a&AS; barabarous&J; poli- cy&NS5; in&P; the&A; other&J; sex&NS5;,&KU; considerably&D;depressed&J;,&KU; for&P; want&NS5; of&P; liberal&J; and&C; classical&J; cultivation&NS5;.&KU; I&R1SN5; will&VUDE4; not&D; expatiate&VI5; largely&D; on&P; the&A; doc- trines&NP5; of&P; certain&J; philosophical&J; sensual- ists&NP5;,&KU; who&R3N5; have&V3PDE4; aided&VAC5; in&P; this&AS; destructive&A; op- pression&NS5;,&KU; because&S; an&AS; illustrious&J; British&JU; female&NS5;,&KU; (&KO;whose&R34; death&NS5; has&V3SDE4; not&D; been&VAC4; suf- ficiently&D; lamented&VAC5;,&KU; but&C; to&P; whose&R34; genius&NS5; posterity&NS5; will&VUDE4; render&VI5; justice&NS5;)&KC; has&V3SRDE4; already&D; written&VAC5; volumes&NP5; in&P; vindication&NS5; of&P; The&A; <lb>Rights&NP5; of&P; Woman&NS5;.&KU; The&A; writer&NS5; of&P; this&AS; letter&NS5;, though&S; avowedly&D; of&P; the&A; same&J; school&NS5;,&KU; disdains&V3SRDE5; the&A; drudgery&NS5; of&P; servile&J; imi- tation&NS5;.&KU; The&A; same&J; subject&NS5; may&VRDE4; be&VI4; argued&VAC5; in&P; a&AS; variety&NS5; of&P; ways&NP5;;&KU; and&C; though&S; this&AS; letter&NS5; may&VRDE4; not&D; display&VI5; the&A; philosophical&J; reasoning&NS5; with&P; which&R3T5; The&R; Rights&NP5; of&P; <lb>Woman&NS5; abounded&VADE5; it&R3ST5; is&V3SRDE4; not&D; less&D; suited&VAC5; to&P; the&A; purpose&NS5;.&KU; For&T; it&R3ST5; requires&V3SRDE5; a&AS; legion&NS5; of&P; Wollstonecrafts&NPU4; to&S; undermine&VI5; the&A; poisons&NP5; of&P; prejudice&NS5; and&C; malevo- lence&NS5;.&KU;