TEI MI M 01 (draft)Draft of TEI XML Migration Task Force Meeting Minutes, 2002-10-13/14
Initials Used for People
- SB Syd Bauman
- AB Alejandro Bia
- LB Lou Burnard
- JH Jessica Hekman
- TR Tobias Rischer
- CR Christine Ruotolo
- NS Natalia Smith
- ST Syun Tutiya
- JU John Unsworth
- CW Christian Wittern
Meeting took place in the Claridge Hotel, Chicago IL, USA on Sunday 13 and Monday 14 October. All times listed are the local-time in Chicago.
Commenced ~13:28 with SB, LB, AB, JH, TR, CR, NS, ST; CW joined ~13:50.
Introductions.
Contents
- Objectives
- Survey
- Identify . . .
- Minimally invasive vs. canonical
- Processing environment
- Discussion of problems found in samples
- Case Studies
- Dividing up Labor for Writing up Reports
Objectives
Review of list of objectives from our charge.
CR Q: are DTDs in scope? Consensus is that they are, but because few people will need help here, low priority. CR: plan to have relatively vague suggestions in recommendation documents.
CR suggests our focus should be on P3->P4. Consensus is that outlining tasks for P3->P4(XML) will include all steps needed for P4 SGML -> P4 XML. Asks if we want to have more of an advocacy role. LB answers yes. SB agrees, but wonders if any advocacy is necessary. LB points out that (disregarding extensions) a P3 document is ipso facto a P4 document.
Brief discussion of why a project wants to move to XML: access to new technologies, new tools (XML); non-support of P3 (P).
We have 2 sections of document already! 1. Scoping; 2. Motivation.
Case studies.
SB asks do we need a test suite? Is it hard to make? JH is concerned we may not be able think of 'em all.
LB points out difference between test suite and using samples. Thinks we need to ascertain what practices are via survey (#2).
CR asks whether or not we need test suite. SB asks how hard is it to do? CW suggests start with list of differences between XML and SGML. 1 .
Question comes up as to who we are surveying: SB holds repository reps insufficient to survey.
Summary that we will not use test suite, but rather results of survey of real cases, perhaps augmented with a fabricated test if deemed necessary.
Although software development is not an output of this group, suggestions for areas ripe for new tools or modifications to existing ones are.
Modifications to ED W 76 made.
Survey
CR: not too many responses. 2
CW explains the recent experience of Character Set WG with its survey.
SB suggests as only 50 projects listed on TEI website, perhaps phone survey. Generally disliked, but LB counter proposes e-mail with caveats of privacy. LB likes e-mail and phone call. Question discussed about whether we just want files or answers to survey questions, too.
So, after identification stage letter asking a very brief survey culminating with asking for only a small data sample (no DTD or other supporting files should be explicitly requested). Non-respondents to be contacted by phone. Respondents for whom we have questions followed-up by e-mail. Also a thank-you.
- Identification of projects using TEI (SGML)
- Survey letter for collection of samples. Telephone follow-up of non-responders (repository group to help)
- Analysis stage: divvy up sample files and check for various features.
- Follow-up based on number and nature of samples — e.g., asking for DTDs when needed, getting info on technical, organizational challenges and opportunities
XML4LIB, TEI-L, HUMANIST, BIBLIOTECH, DIGLIB, LINGUIST, CORPORA, ANSAX-L,
Identify . . .
Split out technical to expert group, organizational to repository group.
Discussion of order of objectives in Charge. Decided charge is really unordered, not to worry about it. CR to provide order in work-plan.
Decided to discuss further issues (e.g. XPointer and other P5ish issues) in appendix to output reports.
Adjourned ~17:18.
Minimally invasive vs. canonical
Commenced ~09:15.
CR reviews discussion from list.
Discussion of whitespace. General agreement that we need to try to munge source whitespace so that parsed whitespace matches.
Discussion of character entity references. LB argues that in migration character entity references should be converted to characters or numeric character references. Consensus is to have prose discussing reasons for desiring this conversion (that later XML processes won't be able to handle character entity references), but to recommend it as an option.
Discussion of external entities.
Discussion on DTDs: yes, we need to keep 'em. XML tools that won't do well-formedness work on files that specify a DOCTYPE declaration are broken, so it's not our problem.
Can address dirty hacks.
Comments: can't have comments inside other declarations; can't have multiple comments inside one comment; <!> not permitted.
‘strategies’ document will have things like advising migrators to think about issues of, say, XInclude v. external entities. ‘practices’ document will have advise on how to convert to XInclude or how to migrate without converting.
In strategy document we should probably point out that more migrations in the future are likely, but that if you're happy with P4, TEI does plan to support it, you could just stay there.
Specification of defaulted attributes: we'll recommend not to specify them (and hopefully point out ways to migrate without them) unless you really need them.
Discussion of DTD conversion: we can't help those who did not use extension mechanism, but we should have a paragraph addressing the problems created by not doing so.
Strategic document should discuss the fact that migration may be an opportunity to improve your DTD.
- minimal conversion
- easy conversion
- conversion that maximizes XML tool usability
- conversion that is forward-looking to P5, or at least what we can predict of P5.
- in depth discussion of macro issues identified in samples
- characters that are in Unicode
-
characters that are not in Unicode
solutions, ala P4 chap 4.2.1
- CDATA
- PIs
- markup (<c>)
-
others
- ambiguous glyph
- glyph exists in Unicode with different meaning in the document
- temp data capture flags
Processing environment
LB points out difficulty in actually managing all the little pieces of a sample (or real) case. Corollary is that practices document needs to address catalog files.
Things to Consider
- instances
- DTD extension files
- catalog files
- style-sheets and other parts of processing environment
Add questions about processing environment to third round survey questions.
SDATA entities to be attacked by a separate individual in practices document.
Discussion of problems found in samples
TR: ??
LB: consultancy may be desirable. General agreement that a workshop on specific issue like, e.g. extension files, would be a good thing.
SB asks about recommending open source v. proprietary software. In resulting discussion LB points out that he'd prefer we say ‘this tool does this’ rather than make a recommendation ‘use this tool’.
- in-house development
- buy proprietary tools
- use open source
Case Studies
CR expects repository reps to write up a case study each. Recommendations for tools & strategies should be ready by mid- to late-December to give repository reps a month to work before joint meeting.
Dividing up Labor for Writing up Reports
Strategic document: MI W 02 Strategic Considerations in Migrating TEI documents from SGML to XML.
- Challenges, opportunities, and motivation.
- Types or scope of migration (P3->P4 or P4->P4)
- Areas of migration (instances, DTD extensions, catalog files, processing environment)
- Levels of migration, e.g. minimal surgery approach, get almost to P5 approach, et. al.
- Appendix: potential impact of future versions of the Guidelines on migration issues.
MI W 03 Practical Guide to Migration of TEI Documents from SGML to XML
-
DTD conversions
- SDATA (ST)
- Extension files (TR)
- Instance conversion: tools. Issues: whitespace & comments, prologue & file structure (e.g. external entities) (JH)
- Recommended work-flow (AB)
Section write-ups due 2002-12-02.
Adjourned ~16:00.