.* TEI Document No: TR M2 .* Title: Minutes of TR meeting, Luxembourg, Oct. 89 .* Drafted: LB 29 Oct 89 .* Revised: MSM into TEI tags 24 Apr 90 .* .im gmlpaper ;.* Use GMLPAPER or GMLGUIDE (or -MLA) .sr docfile = &sysfnam. ;.sr docversion = 'Corrected' .im teigml .* Document proper begins. Minutes of the Meeting of the Text Representation Committee <title>Held at the Commission of the European Communities, Luxembourg <title>October 23-24, 1989 <author>Lou Burnard <docnum>TEI &docfile. <date>&docdate. <attend>Present: Lou Burnard (LB); Roberto Cencioni (RC); Robin Cover (RCC); Susan Hockey (SH); Stig Johansson (SJ), chair; Francisco Marcos-Marin (FM); Michael Sperberg-McQueen (MSM). </attend></titlep> </frontm> <!> <body> <!> <h1>Agenda <!> SJ proposed that as many as possible of the working papers before the committee should be discussed and tabled a list of the five so far received. MSM and LB were requested to report on the work of the other committees so far. RC commented on the quantity of failed and irrelevant messages on the TEI-L listserver; there was little that could be done about filtering messages from the public list, but steps had been taken to avoid some of the failure messages. <!> <h1>Progress Reports <!> <p>LB summarised briefly the work carried out in the Metalanguage Committee the previous week. There had been some useful procedural decisions concerning majority decisions, document numbering and document processing which he recommended to the TR committee. The ML committee had recommended that other committees should use SGML wherever possible and should assume the existence of full SGML parsing software. It had asked the TR committee to consider the problems of translating sets of tagnames between one European language and another and of representing eight-bit encoded tagnames in 7-bit character sets. [Full minutes of the meeting are now available from the ListServer as TEI ML M10.] <p>MSM described the progress of work by himself and LB in defining the structure and content of a database to document the textual features which the various working committees were defining and the names proposed for them. A description of the data to be collected and a suitable form to record it in would be made available as soon as possible. <action><who>MSM, LB <act>to document database proposal <docref>ED P3 <duedate>15 Nov 89 </action> <h1>Working Paper W7: DeRose on character sets <p>The committee disagreed with the assumption that language implied a particular character set. The same language might use many different scripts. Several graphemes might be used by many different languages. <p>There was much discussion of transliteration schemes. SGML allows the declaration of different character sets (defined with reference to a nominated base declaration, by default ISO 646) mainly for parsing purposes. It was not clear where an alphabet (regarded as a set of collating and tokenizing rules and mapped to the bytes of a character set) should be defined: this might be considered application-dependent. The bibliographic information defined by the Text Documentation committee would provide one slot for it, but it would not then be available to the SGML parser. Further study of this question was requested, with reference to relevant standards. <action><who>SH <act>to produce a list of commonly used transliteration schemes <docref>TR W7 <duedate>1 Dec 89 </action> <p>The committee felt that entity names should be chosen from existing sets rather than re-invented. Accented letters in particular should be represented by this means wherever possible. Where no name existed for an accented character, a standard entity name for the accent might be used. Contrary to DeRose's suggestion, it was proposed that accents and diacritics should precede the base letter and be listed top to bottom, left to right. A list of standard names for languages, characters, accents, diacritics and other symbols should be built up, with reference to ISO 639, 3166, 8879, 6937, 646, and ANSI C3.59 inter alia. <action><who>RC <act>to produce a list of standard entity names <docref>TR W7 <duedate>1 Dec 89 </action> <p>Some disquiet was expressed at the implication in the paper that only ASCII was a legal code. Discussion should be character set neutral and English should not be assumed as the default language. References to the ISO 646 character set should specify use of the non-national characters only. <p>As requested by the ML committee, the committee debated whether tagnames should be drawn from a more restricted character set than that used in document instances. No consensus was reached. Clarification was requested on the effect during parsing of tagnames containing shifted characters (as in ISO 6937). <p>SJ raised the question of ambiguous punctuation marks. MSM suggested that entity references might be used to provide some disambiguation. Further information was requested. <action><who>SJ <act>to provide a list of ambiguous punctuation marks <docref>TR W7 <duedate>1 Dec 89 </action> <action><who>SDR <act>to redraft recommendations on character set usage in light of the committee's discussion <docref>TR W7 <duedate>? </action> <h1>Document W8: Chesnutt on core tags <p>The committee felt that tags should be interpreted in a context sensitive way. There was no need to include the level of a tag in its name, and not doing so would greatly simplify the process of markup. <p>A list of proposed core text features was still to be derived: the paper had not begun to address this problem. <h1>Document W3: Johansson on Corpora There was a long discussion of points arising from this paper. SJ suggested that defining characteristics of a corpus (rather than a collection) were a uniform encoding scheme and a uniform reference structure by which the origins of its component text units might be identified. It was noted that the `coding keys' of SJ's paper would be adequately represented by a set of suitable declarations in a DTD. A need for a tag to identify uncoded or non-linguistic text features (such as dingbats etc.) was identified. <p>Errors in the source text might be silently corrected and documented in the text header. It would be preferable to tag explicitly text features which the encoder regarded as erroneous. SH distinguished three possibilities: possibly erroneous, possibly erroneous and corrected, definitely erroneous. These might be tagged as follows: <sl> <li>[sic]eroneous[/sic] <li>[correction by=SJ was=eroneous]erroneous[/correction] <li>[error]eroneous[/error] </sl> <p>The need to tag features (for example long quotations) which did not nest within paragraphs was discussed. MSM said that the CONCUR feature could handle this. <p>Further consideration of the need for a referencing system, lead to much ontological discussion. SJ felt that paragraphs and sentences were useful units of study in their own right rather than simply convenient reference points. However, it was important to know the principles by which they had been defined, since they were not well-defined linguistic concepts. In the absence of any formal definition for `sentence', the principles by which sentences had been identified should be documented in the text header. SJ noted that tagging sentences was more important, in his view, than tagging typographically obvious features such as lists. It was agreed that, provided clear principles had been stated which covered 95% of cases, explicit tagging might be needed only for the remaining 5% There was some doubt about the best method of doing this: guidance was sought from the ML Committee. <action><who>MSM <act>to raise with ML the problems of sporadically tagged sentence division <duedate>? </action> <p>Numbers in lists could be regarded as titles or simply as attributes of each list item. The choice depended on the application. <p>It was felt that embedded sentences, e.g. in direct speech, should not pose any particular problem. It was noted that the speech/narrative hierarchy could be distinguished from the sentence hierarchy. <p>The need for consistent guidelines on the tagging of `foreign' words was identified. The typographical conventions used in sources were not always consistent. Sometimes a shift into a foreign language should imply a shift of alphabet, but not in all cases. Because change of language was always associated with some other element, at the least a word, MSM suggested that an attribute would be an appropriate way of encoding it: <xmp> [highlighted.phrase style=italic lang=french]comme ceci [/highlighted.phrase] </xmp> <p>There was considerable discussion of the need to identify explicitly typographic features of the source. Some felt that such features were always of analytic importance, others that they might be of use only during an intermediate stage of text production should be used, and yet others that they should be shunned entirely. It was argued that the need to reproduce the appearance of the source would eventually be met by standards such as DSSL and SPDDL and that complete physical description of a source was impossible. It was pointed out that however much TEI guidelines might deprecate the use of tags such as [italic], people would in any case invent them. In the absence of any consensus, a short list of basic typographic features with examples was requested in order to focus further discussion of their analytic functions. <action><who>SH <act>to draft a sample list of up to ten typographic features <docref>TR W3 <duedate>1 Dec 89 </action> <p>It was agreed that features such as numbers, fractions and dates needed special treatment. Hyphens in date ranges and stops in numbers needed attention, and it might be useful to normalise dates, for example [date, iso=19461209]9 December 1946[/date]. Further examples were requested. <action><who>RC <act>to draft a paper summarising problems in representing numbers, dates etc. <docref>TR W3 <duedate>? </action> <p>The need to tag subject matter and type of document was identified. If a closed set of descriptors was in use, its name should be specified. Recommendations on the advisability of using a controlled vocabulary would be helpful, and the document should call attention to existing thesaural lists wherever possible. SJ noted that text classification could also be deduced by investigating feature usage within a text, citing a recent book by Douglas Biber. <action><who>SJ <act>to continue work on the points raised in the paper in light of the preceding discussion <docref>TR W3 <duedate>? </action> <h1>Working Paper R7: DeRose on the AAP Standard <p>This paper was agreed to be a good basis for the core tagset requested as TR W6. MSM repeated his early opposition to the naming conventions proposed, in particular to the prohibition on long names at the expense of comprehensibility. <p>It was noted as a general rule that different tag names should be used only for features with different content models. If the content models of two features did not differ, they should be distinguished by using an attribute. The committee agreed with the paper's expressed preference for this method over the other three specified on page 4. <p>The committee suggested that the discussion of tables could be usefully extended by reference to other schemes, notably that used by FORMEX. <action><who>SJ <act>to ask SDR to add a section comparing the FORMEX tags for table definition <docref>TR R7 <duedate>? </action> <p>The committee agreed to accept the list of features in the paper as a preliminary list of core elements. David Chesnutt should be asked to comment specifically on which of these tags should be included at the base level. Steven DeRose would be asked to include the features defined here in the TEI tag database as soon as this became available. <action>SJ <act>to ask DC and SDR to collaborate on improving W6 <docref>TR R7, TR W6 <duedate>? </action> <h1>Material prepared by Cencioni on office document standards <P>RC began discussion of the material by asking whether proprietary encoding schemes should have been included (e.g. EBCDIC, the IBM PC `world trade alphabet', or even the Macintosh codeset), noting that an additional character set declaration added only one line to a DTD. <p>The tendency of the materials studied to concentrate on document management facilities rather than document analysis was noted. <p>The AAP reference manual was commended as a suitable example for the presentation of the TEI results.The FORMEX tag scheme seemed well structured but rather limited in scope. The CALS standard provided a more ambitious and well designed complex SGML application, although in an area different from that targeted by TEI. <p>The committee agreed that the material gathered in this working paper would be useful in the task of preparing feature sets for inclusion in the TEI tag database as soon as this became possible. <action><who>RC <act>to collate relevant features from W1 into an ordered feature list <docref>TR W1 <duedate>? </action> <h1>Working Paper W4: Mylonas on literary texts <p>The committee noted that the basic distinction between prose and verse was not applicable to all texts, notably dramatic ones. Line-break problems did not necessarily need to be handled by a separate hierarchy and support for CONCUR was available. <p>It was suggested that membership of the working group mentioned in the paper should be broadened to include representatives with more detailed knowledge of a wider range of literary genres. <p>There was some discussion of the possible uses for physical line information, other than as reference information, in prose texts. Available mechanisms mentioned were CONCURrent hierarchies; empty tags at segment boundaries; explicit reference points. <p>It was agreed that the document provided a good basis for further work but needed considerable expansion. <action><who>MSM <act>to provide a list of genres for consideration by the working group <duedate>? </action> <action><who>LB <act>to provide further comments <duedate>1 Dec 89 </action> <action><who>SJ <act>to ask EM to continue her work on encoding of literary texts <duedate>15 Jan 90 </action> <h1>Notes by Marcos-Marin on Cross references <p>FMM gave a brief tutorial on the cross-referencing mechanism used in IBM's GML markup scheme, which is almost identical to that available via the ID/IDREF mechanism of SGML. MSM noted that by requiring that id values be unique the SGML idref mechanism allows a parser to validate footnote references, which is impossible if they are simply given as content of an element. <h1>Next Meeting <p>This item was taken up at this point because of MSM's imminent departure. <p>The next meeting would be held in Oxford around the end of February 1989. There was some discussion of the most profitable way of organising what would be the last funded meeting in the current cycle, e.g. by organising small focussed parallel discussion groups. MSM noted that material not submitted to the editors by the end of March could not be included in the draft Guidelines due for publication by the end of May. SH offered to arrange visiting lectures or seminars for those wishing to stay over in Oxford. Dates of the meeting would be confirmed before the end of November 1989. <action><who>SH <act>to confirm next meeting date and time <duedate>1 Dec 89 </action> <h1>Working Paper W2: Cover on biblical texts <p>Due to lack of time the paper as a whole was not discussed in detail. RCC circulated some hair-raising examples of the complex textual structures typifying polyglot bibles, syllabaries etc. This lead to a detailed discussion of different ways of handling textual variations. It was stressed that the appearance of the apparatus was not the primary object of such an exercise: variation was the feature to be tagged. Transcribing exactly a syntactically rigorous apparatus might achieve this end as well as creating multiple full sources with tags to indicate the points at which variation occurred. <p>SH wished to distinguish textual variation of this kind from version control, of the kind which would typically be addressed by the Text Documentation committee. MSM argued that version control was exactly equivalent to critical apparatus. RCC wished to distinguish variation such as that introduced by translation from redactional variation. SH commented that different referencing schemes (RCC's 4th problem factor) were examples of parallel synchronous structures. It was agreed that the paper provided an excellent basis for discussion of these and related issues, for which a number of possible solutions existed. RCC was asked to convene an electronic working group, membership to include W.Ott, D.Durand, P.Robinson (Oxford) and others to be identified. <action><who>RCC <act>to convene working group to discuss problems of versioning and apparatus <docref>TR W2 <duedate>? </action> <h1>Working Paper W5: Renear on Philosophical Texts <p>There was general agreement with the theoretical position advanced by the paper and with its modus operandi, though not everyone agreed that publishers' style sheets would provide as much useful information as the paper seemed to imply. RC questioned the portability of tags at the level of description proposed by the paper. LB stated that the doctype in which features defined by this paper would be additional to a generally agreed framework, not a substitute for it. Clearly there could be many such specialised doctypes. <p>The committee wished to see lists of the tags so far identified by the group, including examples of their application. The form in which philosophical elements should be identified should be no different from that of the textual features to be identified by other working committees: as input to the TEI tag database. <action><who>SJ <act>to ask Alan Renear to continue the work plan outlined in TR W5, providing the feature set input requested <docref>TR W5 <duedate>? </action> <h1>Closing discussion <p>There was some further discussion of character set problems. Guidelines on when entity references should be used in preferance to private (or public) character set shifts were felt to be necessary. The interdependence of language and character sets was recognised. SH asked whether the international phonetic alphabet should be regarded as a character set or an alphabet. <action><who>SDR <act>to consider status of IPA <duedate>? </action> <p>It was agreed that the next meeting would be largely devoted to analysing the lists of features produced in the interim with a view to identifying omissions and any duplications not already filtered out by the editors. <action><who>MSM, LB <act>to provide initial report from the TEI tag database <docref>? <duedate>1 Feb 90 </action> <p>In conclusion, RC urged the committee to concentrate on the identification of helpful structural features rather than to reiterate interpretational problems. He also urged the TEI to seek registration of its work with the relevant standards bodies as soon as possible. LB stated that an action was already outstanding from the ML committee to raise this matter with the Steering Committee. </body> </gdoc