Welcome home! I hope you enjoyed the meeting as much as I did. There follows my best attempt at some minutes, dutifully if experimentally tagged. Please read solicitously for anything I have forgotten or totally misunderstood. I can send you a copy of the notes I actually transcribed if you want them - they are slightly longer as to content but much less perspikyuous as to expression. I have not added institutional affiliation for anyone: please do so if you think it necessary. And then post. best wishes Lou ------------------------------------------------------------ &AI; TEI-AI-M1 Minutes of the meeting held at Oxford University Press, September 21, 1989 Lou Burnard Present: Robert Amsler (RSA); Steve Anderson (SA); Bram Boguraev (BB); Lou Burnard (LB); Nicoletta Calzolari (NC); Nancy Ide (NI); Terry Langedoen (TL), chair; Winfried Lenders (WL); Nelleke Oostdijk (NO); Bill Poser (BP); Beatrice Santorini (BS), replacing Mitch Markus; Gary Simons (GS); Michael Sperberg-McQueen (MSM); Donald Walker (DW); Antonio Zampolli (AZ)

Introductory business

TL welcomed members of the committee and drew their attention to the agenda previously circulated.

Expenses

Committee members were reminded to send *all* receipts for travel to MSM at UIC. A maximum per diem of $20 was payable without receipts; with receipts up to 50% of travel costs would be reimbursed. North Americans should send originals and would be reimbursed (up to 50%) in US dollars; Europeans may send copies and would be reimbursed in ECUs.

Timetable

Three meetings were scheduled for the current funding cycle. A draft of the committee's guidelines was due by March 1990. The next meeting could conveniently (except for some Europeans) be held at the LSA/MLA meeting in Washington during the week between Christmas and New Year. The final meeting would be held, probably in Tuczon, at the end of February. MSM gave an overview of the TEI timetable; funding for the second cycle had already been requested of NEH and a decision would be made by the end of April 1990.

Committee Brief

TL stressed that the committee's task in the current phase was more to decide *what* to mark up than exactly how to do it. He had had the opportunity of discussing SGML with the chair of the Metalanguage Committee (David Barnard) and saw no technical reason why it should not support the committee's requirements, however imperspicuously. It was clear that the committee's brief would take SGML into corners it had not as yet visited, but the Metalanguage Committee would have the job of identifying and then supporting any extensions to the standard which might be required. There was no requirement that the committee adopt SGML for its own internal working: its job was to produce tagsets. He likened the committee's role to that of the body which defined IPA: it had to define a universal language for text structure, capable of supporting any one of a number of possibly inconsistent or mutually exclusive theoretical frameworks. No unification of theories was expected of it, nor would it be responsible for implementing any part of its proposals. As with IPA again, the object was to define an interchange format rather than an application-specific one.

General Discussion

RSA drew a parallel between the committee's task and that of defining purely typographic markup. It should focus on tagsets rather than on theoretical discussions. BP asked whether linguistic rules should not be tagged as such, since they would necessarily form a part of the content of some texts to be tagged. TL said that a full analysis of formulae as such should be deferred to the second cycle. The goal should be to tag such things adequately to ensure their correct representation, independent of any particular formatter. If two theories are disjoint with respect to a particular feature (traces for example), it should be possible but not obligatory to supply it. BP stressed the importance of making provision for less orthodox theories using entirely different representations. TL said subcommittees should aim to be catholic, identifying as many theories as possible. NC asked whether when considering existing markup schemes, e.g. TOSCA or LOB, it would be necessary to identify conversion rules, grouping tags according to their function. TL said that such mappings were the responsibility of the Metalanguage Committee. The main focus should be on linguistic issues and alphabetic texts. BP asked about polylinguistic texts and non Roman alphabets, for example Gardner's Egyptian Grammar, or Japanese texts with embedded Chinese characters. TL replied that the Text representation committee would address these problems, although the question of syntactic features linking parallel structures was within the A&I committee's brief. MSM said that the question of synchronised structures was a good example of an area where several committees would need to work, as were cross-reference and the markup of arbitrary discontinuous (`fuzzy') text segments. He stressed the importance of good communication between the committee heads and the editors in this respect. TL stated that linguistic markup should include all of semantics and pragmatics, but acknowledging Lakoff's view that all domain-specifying categories were artificial, it should also be possible to smear all these distinctions.

WL asked whether the intention was to provide a language- independent superset of tagsets, citing the MATER standard (ISO 6156) as one which had found the need to include a German-specific appendix. TL stated that his personal preference was for a superset. DW pointed out that no particular tagset would use all available tags and MSM that the Steering Committee had already decided tagsets should be extensible, and re-nameable.

Subcommittee Work

Dictionary Encoding Subcommittee

RSA reported on the Dictionary Encoding Standard worked out in collaboration with Frank Tompa at Bellcore, and circulated copies of their joint paper describing it. Two areas of activity were proposed for the subcommittee: one was to broaden the work to include multi-lingual (non english, polylingual) and the other to consider etymology. Frank Abate had volunteered to investigate the latter. Alain Pierrot was collaborating with NI on a DTD for Hachette's dictionaries. John Fought and Carol Van Ess-Dykema were developing a multilingual standard, with language-specific extensions for each lang.

AZ asked whether this subcommittee would also deal with machine-tractable dictionaries, or electronic lexica. RSA replied that there was considerable overlap of interests among the proposed membership. A brief discussion of the wisdom, or lack of it, of recoding lexica expressed in LISP in SGML ensued. AZ opined that it was organisationally preferable for this subcommittee to focus only on printed dictionaries, aiming at a neutral interchangeable format. Output from other subcommittees (morphology for example) would be useful at a later stage in the project. [Since the meeting, a separate subcommittee to address the issue of defining a standard for encoding electronic lexica has been proposed, under the chairmanship of Bob Ingria]

Other members proposed for the subcommittee included Susan Warwick (ISSCO, Geneva), Carol Van Ess-Dykema (NSA), John Fought (?) and representatives from groups at Pisa, Bellcore, IBM, and IKP (Bonn). Communication between these and commercial publishers was important. RSA said that the paper had been tabled to invite discussion: it needed many more examples and more descriptive text. he would circulate any comments received on extensions to it.

Phonetics/Phonology Subcommittee

This work of this group, chaired by Bill Poser, would to some extent overlap with that of the Text Representation committee. It would additionally address such issues as hesitation, intonation, and overlapping, and the correspondence between phonemes and graphemes, but would need to prioritize these carefully. AZ reminded the committee of the importance of supporting the needs of the speech synthesis community in this context. RSA asked whether the dictionary encoding should attempt to define phonemic equivalences in IPA or something else: his view was that they should not.

After lunch, the following were suggested as possible members for this subcommittee: Ken Church; Mark Liberman; Henry Thompson; Janet Bernstein/Jared Brunstein (?); Lauri Lamell (MIT); Janet Pier-Humber[t] (NWestern); Brian MacWhinney; Paul Roussin; Bob Mercer; Jan Svartvik; John McCarthy. TL asked whether there was not a need for an "ordinary working phonologist" on the group and undertook to find one.

BP said that the work of the subcommittee should include "gestural stuff": its task was not to propose a Klatte-style "arpabet" but to make it possible for anyone who wished to use one to define such a code in a portable way. He asked where the kind of phonemic markup employed should be specified and whether its semantics should be specified with reference to e.g. IPA. MSM saw this as another area where this committee's work overlapped with that of another: the Text Documentation committee would provide a space into which declarations of this kind could be placed, but little more. If texts include application specific data (for example F0 values) it was clearly necessary to provide portable ways of interpreting them.

Morphology Subcommittee

This subcommittee currently comprises Steve Anderson (chair), Winfried Lenders and Gary Simons. It will address such standard issues as the delimiting and classifying of words, aiming at generic solutions rather than value lists. SA suggested that many substantive categories such as dialectal or usage variants are not morphological but lexical information. The subcommittee should focus be on the representation of the internal structure of words, recognising however that simply delimiting morphemes would be inadequate for discontinuous segments (e.g. in Arabic) or for the use of such tricks as ablaut in Germanic languages, or metathesis in Saylish to render aspectual distinctions. SA suggested that the most promising line was to identify and generalise the relationships existing between different forms, regarding morphology as the internal syntax of words.

Members proposed for the subcommittee included: Martin Chodorow (?); Richard Sproat (AT&T); Kimmo Koskiennemi (Helsinki); Laurie Karttunen (Xerox-PARC); Jorge Harkamen (UCSC Santa Cruz); Mark Aaronov (SUNY Stony Bk); John McCarthy; Lisa Selkirk (UMASS at Amherst); Burghardt Schaeder (Siegen).

There was some discussion of the level of generalisation appropriate to the subcommittee's work. NC and AZ pointed out that for most people identifying the lexical item (lemma) appropriate to a surface form was of far more importance than its internal structure. AZ asked whether compound words would also be considered. SA replied that these were a special case of the general rule. TL said it was important to support different levels of analysis. RSA suggested that some redundancy in the encoding would be a helpful way of supporting this. He also recommended that as much language- specific information as possible should be identified and shared amongst members of the subcommittee. BP remarked on the existence of many large corpora of Amerindian languages exhibiting many unusual features. RSA recommended the use of consultants with expertise in these areas, mentioning David Nash for Australian Aboriginal languages.

Resuming the earlier discussion, MSM pointed out that the coding schemes used by existing tagged corpora often blurred lexical, syntactic and morphological distinctions. He felt that it was enough to identify places where a value could be recorded without attempting to unravel its semantics. RSA noted that the DEI often specified alternative ways of encoding a given feature; GS that tags were treated isomorphically with data in the Brown Corpus. AZ was firmly of the opinion that a well defined set of values should be identified, for example, for part of speech, rather than an open ended set. RSA remarked that SGML gave us a better notation than that available to earlier projects which often needed to attach attribute values to every token because they lacked the notion of markup distributed throughout a text.

Syntax subcommittee

Beatrice Santorini reported on behalf of Mitch Marcus who had been asked to form this subcommittee, together with herself and Nelleke Oostdijk. They felt that whatever was to be provided should be able to specify both ambiguous and hierarchic syntactic structures, to cope with a variety of re- analysis phenomena and other syntactic ambiguities. A single word (e.g. the Japanese causative) might require a bi-clausal analysis. Multiple simultaneous representations of a string might be needed, for example "(to take advantage [of) John]." TL remarked that David Barnard had stated that such things could be managed by SGML. SA asked whether it was also capable of Postal-style arc-pair grammar. On ambiguity, RSA highlighted the need to distinguish the deliberately ambiguous from the merely vague, contrasting "The duck is ready to eat" with "light house keeping". NO asked whether idiomatic and figurative phrases belonged in this subcommittee: most present agreed with her that they did not. Idiomatic phrases formed a convenient unit, but were not in fact phrases. They also had multiple class membership. TL mentioned the need to support inheritance of properties within a hierarchy by placing attributes as high as possible in the tree: if tense is only marked as a property of verbs, it becomes difficult to deduce the tense of a sentences.

The following people/syntactic theory pairs were suggested for this subcommittee: Annie Zaenen (lexical functional grammar); Pustejovsky (lexical semantics); Geoffrey Leech; Geoffrey Sampson (Firthian); Robin Fawcett (systemics); Beth Levin (government binding); Wehrli (GB); Gazdar (GPSG); Don Hindle (Bell, Murray Hill); Hans Uskoreit ; Bob Ingria; Elizabeth Engdall.

Actions

  1. Subcommittee chairs should co-opt people to their committees and produce interim reports by Halloween, 1989.
  2. Subcommittee chairs were requested to save and precis for distribution all correspondence with potential members. TL offered the services of graduate students for assistance with TEI work.
  3. All working documents should be sent to TL who would assign numbers and post them on the TEI-ANA server.
  4. The Amsler/Tompa paper will be given a number shortly, and comments were requested as soon as possibly, particularly with respect to extending its scope to include polylingual and non- English dictionaries and to discussing the etymology problem.
  5. The committee should agree the structure of an overall interim report. It was agreed to differ discourse analysis to the second cycle.
  6. MSM requested that draft documents be distributed using some form of descriptive markup to simplify their later reuse. He and LB are working on a tagset for this purpose, but committee members should not await its appearance before putting finger to keyboard.
  7. Minutes of the meeting to be distributed by the end of September 1989.
  8. TL to investigate causes of Hans Uzkoreit's silence.
  9. NI to circulate details of the MLA/LSA meeting

Next Meeting

In conjunction with LSA/MLA meetings in Washington, probably following the TEI presentation on 30 December 1989. All subcommittee chairs would be attending one or both of these meetings.