TEI CE M 01TEI Character Set Work Group Meeting Minutes, 2002-07-23/24Tübingen, Germany
Initials used for people present
- DA Deb Anderson
- SB Syd Bauman
- MB Michael Beddow
- DB David Birnbaum
- BB Brian Bruya
- LB Lou Burnard
- PD Patrick Durusau
- TM Tone Merete Bruvik
- MT Masayuki Toyoshima
- CW Christian Wittern
Note that TM and BB were only present during day 2, and that PD missed the first hour or two of day 1.
Initials used for people, not present
- EO Espen Ore
Meeting took place in ZDV, University of Tübingen, Tuesday 23 and Wednesday 24 July, 2002.
Contents
- Introduction, administrative announcements
- Review of relevant sections in P4
- Review of use cases and current practice
- Architectural issues: Processing model for TEI texts, modularization of WSD.
- Closing remarks, thanks to the hosts.
Introduction, administrative announcements
SB taking minutes.
Thanks to local organizers.
Building we are in closes at 16:30, but we can stay later 1 .
LB hands out P4 CDs. Note that paper insert still says ‘1999’, ignore. It's new.
CW suggests that when viewing Guidelines 2 it would be good if reader could tell what text has been modified. General agreement. Editors agree in principle; they point out that there is a not-quite-yet-published errata list for P4. 3
Welcome to visitor MT.
Review of relevant sections in P4
Editors thankful to CW & CE-WG for effort put into P4.
CW asks if effort should be put into P4 or P5 (answer from LB, with SB concurring, is that except for errors and the like, we're done with P4, concentrate on P5.) The question then arises as to whether the requisite chapters in P5 should be rewrites of P4 or start afresh. LB answers either, hope there's a base in P4 that's usable.
CW asks if P5 will support SGML at all. LB responds there will be resistance to dropping SGML. But WG could recommend that P5 SGML projects use Unicode as the base character encoding.
SB points out SGMLers could still use P4, of course.
Note to XML migration WG to consider character set issues, particularly transcoding.
- in front of every document in subset or header;
- in separate free-standing document & link to it either by explicit or implicit link;
- use current external entity mechanism.
SB asks should we work on end of Ch 25?
MB points out 25 not used often, but 4 is pedagogically wrongish, too much ‘how we got here from there’ stuff.
Editors point out audience for P4 was P3 users, but P5 can start with new user in mind.
MB states that in particular the reader is told about variants 4 too much too early.
DB suggests Unicode's ‘gentle intro to characters and glyphs’. CW asserts it is geared towards software developers, not encoders. DB says he has some experience with it with his students, and still recommends it.
LB points out that CH is unlike rest of Guidelines because it is introductory not a reference. LB recommends, analogous to SG, CH presents that introduction necessary to understand Unicode as you need to know it for TEI, but written as reference not as tutorial.
MT asks why not divide P5 into a reference and a separate supporting (tutorial, introduction?) part. In response PD asked what parts, if any, is not needed? SB thought the details of byte-order; LB pointed out that MB was claiming the history was not needed in a tutorial. CW suggested that the history is needed in the tutorial. 5
LB suggests WG thinks about what goes into P5 in place of CH. A complete rethink may be in order, stating principles up front (may want 2 chapters, one with how to, other with whys and wherefores).
CW points out that completely free-standing is less visible; LB that he wasn't excluding included & separate (like SG).
CW suggests that we further nail down principles underlying character sets (CH & WD) of P5, then find volunteers to draft.
On the strength of MB's volunteering for part of the task, and on CW's recomendation, the WG appointed MB to draft first an outline (by mid- to late- August, and then a draft of the replacement(s) for CH and WD.
After MB expressed some trepidation, as he does not feel qualified to draft several of the subsections that would be involved, LB and others assured him that he was not being left responsible for the entirety — he should feel free to leave ‘this section to be written by someone else’ as the content of some of the subsections. LB noted that the subsection headings should still be included in the outline. 6
The next question discussed was how to divide up the various topics that need to be covered. MB points out that there is lots of stuff TEI has to do even if other bodies had done their jobs perfectly. LB suggests not dividing at drafting stage, but rather cover topics, and then go through draft 7 .
MB suggests avoiding clashes by paying attention to MSS WG work.
Responding to SB's query CW says CH needs to fill in gaps. Many projects answered ‘we use English, so we have no problem’. PD raised the question, if they don't think they have a problem, do they?
LB: need a section ‘why this chapter is needed’
DB suggests we pick up Lou's suggestion and outline what CH should look like, and then suck in parts from P4 as needed.
MB to draft new P4 during mid-Aug with PD & tei-chars help; DB suggests outline early
CW suggests we defer this discussion until later in meeting
Under ‘items we need to cover’, reviewed CW's presentation at Pisa that reviewed the Berkeley meeting.
CW says we will table until later.
Review of use cases and current practice
Besides TEI-L, requests were sent to unicode, unicore, linguist, and some other list. DA says that there was also a poor response from these lists, but what results we have indicate that some projects are using the PUA.
MB: tagalog to use Unicode?
MENOTA has decided, as far as possible, to do diplomatic transcription at the character level. Produced extensive documentation how they plan to encode & document use of PUA for this. They have system to generate entity name, have created names even for those that already have ISO names. LB considers it a mistake. MB says that we need to give Guidance; LB that we need to warn against making this mistake.
DB says it's not just presentation & encoding (glyph & character), but two different levels of information. General agreement.
CW: We need to provide entity-less solutions for non-valid XML documents (which are by definition well-formed).
Ascertain: are there entities with the various schema flavors?
CW presents his use-case, <gaiji> .
The question was raised as to what is the difference between using <orig> with reg vs a WSD solution to normalized characters? CW: difference is that there are multiple modern versions.
PD asks if by doing this in WSD are we limiting future use.
DB points out if you don't make a distinction in the document, you won't be able to do this kind of normalization.
MT asks about using multiple WSDs; ...
Do we need more use cases? PD suggests looking at the encoding of various sites without eliciting a case from the project.
CW: we need to develop an inventory of 'em, if not an actual test suite, but we don't need to discuss them here & now.
Lou is interim TEI webmaster. CW & LB to have TEI site mirror CW's documents. E.g., this document will be CEM01. WG documents should be submitted in TEI P4 Lite XML.
Architectural issues: Processing model for TEI texts, modularization of WSD.
Issues related to the extension/modification of the document character set.
Problem: once a character code point is entered, there's no way to recover which of variant glyphs was used. (CW demonstrates with slide from his paper.)
PD points out that you can already do this with current WSD.
- you point into font at the ‘right’ spot, but the glyph you want isn't there;
- 8
CW thinks LB's examples represent problems that should be handled in stylesheet.
DB points out that WSD does have mechanism to point into abstract glyph inventory. (AFII — although CW notes it probably will be nuked)
DB points out that as long as we handle many-to-one and one-to-many character to glyph cases, 1-1 will follow; also warns we need to come up with processing model that can actually do something ...
LB suggests nothing has changed, we need to create WSD-NG that is expressible in XML.
- put character into document, i.e., being able to define characters
- define character semantics
- define linguistic properties
It was suggested that we will need two passes: special TEI parse either 1st or 2nd; perhaps WSD as XSLT stylesheet first. DB states that will we need, at a minimum, a triplet of information: unicode_char_point+, glyphs+, unique_entity_name
DB presents defining ‘<!ENTITY lou "GOLOOKITUP-lou">’ solution.
LB presents problem: <c attr-that-points-to-table-entry="it">Dz</c>. 10
DB suggests that TEI could insist that if you need this level of character control you can't use <sic> / <corr> . 11
If we're going to go that way ( <c> with attribute), CW says we should look at SVG & MathML carefully.
MT asks about composed characters — CW points out you can have multiple characters as content of <c> , and that the pointer can point to composed characters.
WSD addressed problem of single characters -> multiple glyphs, multiple characters -> single glyph by overloading entity references.
LB points out that NDATA declared entities remain unparsed.
- use NDATA unparsed entities (if it works — looks like no)
- <c> element with pointer attribute to point to WSD data (problem: need to create non-attribute versions of <sic> / <corr>
- TEI-markup that flags character
- generating entity tables from WSDs: more complicated parsing, but works in attribute values
Could use a driver file to combine them. Won't work with many XML processors that read whole document first.
Considered model 3 above in more detail, including using PI before DOCTYPE declaration.
Need to make sure that it is implementable.
DA — Unicode will be using variant selectors as a fall-back position.
Do we want to have a mechanism that duplicates creating new Unicode character or variation selector? DB says yes, but because of fringe cases not because the Unicode Consortium would be nasty about it.
CW thinks definitional and semantic properties should be on different levels.
SB: Extra elements we need to create to get around <sic> / <corr> and the WSD-like <teiHeader> elements could be an additional tagset that most TEI users won't need.
solutions in XML
DB presents his work of last night: proof-of-concept implementations of each of our 4 solutions in XSLT.
possible solutions
-
Solutions that use no elements (and thus
can be used in attribute values):
- TEI-specific entity-like references
- Using PUA code point as an index into WSD-NG. 12
-
Solutions that require elements:
- The NDATA solution: declare the WSD as an unparsed NDATA entity
- The ‘replacement-in-the-instance’=20 solution: use the <c> element
- The ‘table-in-the-header’ solution.
Discussion of inability to put icky characters in attribute values. What about n or rend? Some feel it is quite reasonable to want icky characters there.
Multiple solutions? Requires that document exchangers be able to process each solution.
PUA? Have a PUA value for each <gaiji> 13 ; processor has to read WSD first, of course.
things to do
DB to write CE W 02, XSLT proof-of-concept of possible solutions.
CW to write CE W 03, use cases.
Topics still to consider: xml:lang and declaration & description of languages.
DA to find out how others use PUA characters.
MT points out that Japanese dictionary group had used PUA approach but this year abandoned that approach and now just puts an image in. But LB points out this does not invalidate it for our purposes.
MB wants us to define the ‘base’ (I think he me means non-WSD) method of general person encoding a non-Unicode character.
MB points out that xslt2 will provide writing and node-set conversion, either of which would enable two-pass processing.
- Provide element alternatives to attributes (thus allowing for element-required solutions).
- Tell users they can't have special characters it in attributes.
- Can do it in attributes, but have to parse it yourself (either by using the special TEI-markup entity-like reference (‘HI-LOU’ in DB's examples) or by using standard markup escape (<)) and parse it yourself.
- Do nothing. So there!
Issues related to the declaration/description etc. of the language(s) used in the document.
Other issues
LB raised question of one project trying to accept documents from two different other projects that use their own entity names and PUA code points — overlap (using same name or code point for same character) could occur with either or both, collisions (using same name or code point for different character) could occur with either or both. Thus recipient would have to exhaustively compare each value to all other values. Non ideal. Also have to be able to differentiate PUA from normal code points.
Test of markup strategies
Future planning
EO (PD as fall-back) to write CE W 04, problem description of 4.2 — in P3 lang does everything (both natural language & writing system); but we also have xml:lang. Berkeley meeting resulted in decision to divide them. How you indicate what language, how you define it, and how you uncouple it from the WSD.
CW to write CE W 05, content of WSD and its <gaiji> .
Drafts due Mon 2002-08-26.
LB to get web posting easier before 2002-08-26.
Closing remarks, thanks to the hosts.
CW points out, with complete consensus, that meeting has run very smoothly despite no local organizers being on the work group. Good job U of Tübingers!