Notes from TEI Migration Task Force Combined Meeting 2003-02-07/08TEI MI M 07

Initials Used for People

SB Syd Bauman
AB Alejandro Bia
LB Lou Burnard
TE Tomaz Erjavec
JH Jessica Hekman
AK Amit Kumar
CP Chris Powell
TR Tobias Rischer
CR Christine Ruotolo
SS Susan Schreibman
NS Natalia Smith
JW John Walsh
SW Sarah Wells
FW Frans Wiering

All times are local to MITH (i.e., -05).

Started late (due to weather) at ~14:15 with CR, SS, TE, FW, LB, JH, AB, SW, JW, SB. CP arrived @ ~15:07.

Group extends a gracious thank-you to SS, MITH, and also to Amit Kumar. SS and AK have gone above and beyond to see that this meeting works despite the closing of the University of Maryland due to snow.

Appendix B: Outline of MI W 02

Discussion of survey project; little done so far.

LB announces Alan Morrison looking into using new projects TEI webpage as springboard for survey.

Consensus is to perform a small survey using the list of projects available on the aforementioned TEI applications page; perhaps do larger survey later on a different or extended grant.

Strategic document should have an overview of how the process should be performed — who does what, whether to stop production, how to change workflow, etc.

FW: DTD extension problems; reports bug in tei2tei.xsl leaves attribute name but not value for defaulted attr. SDATA problems: using SDATA entity referencess for renaissance musical notation, some of which are not in Unicode.

Discussion of whether to discuss SX or osx — consensus is to discuss both, including difficulties of building osx.

LB reports that CE WG is working on this [FW's SDATA problem]. LB thinks only solution is going to be PUA use, and that WG is going to recommend the encoding thereof.

FW asks how can one use a font to represent a PUA char. No one actually knows.

Action 1: JW investigate fonts & editors 2003-03-10

. LB suggests Vusillus (by Ralph Hancock of TLG) for Classical Greek, and Cardo for mediaval stuff.

Discussion of fonts to be incorporated into SDATA section of technical document.

TE asks about depreciation of named entities; big discussion on whether XML requires named entities to be declared or not. Consensus is that we will discuss the disadvantes of using named entities in SDATA section.

SB suggests tools section admits that JH works on osx.

FW points out that XMetal can convert SGML to XML. (Discussion as to whether it does or not to be discussed on list later.)

Action 2: SB Ask Daniel Pitti about using Notetab Pro to convert 2003-02-15

We should include in our survey a question or small section asking people about tools they use.

TE: points out discrepancy in tools (sx v osx); also he felt there was no where to start, so he used checklist.

Group considers software for out-of-the-box TEI Lite (P3) to TEI Lite (P4 XML) something we'd like to be able to recommend.

SB suggests that MI W 04 be rolled into an appendix of MI W 03 and, similar to SS's suggestion, be referred to by the first steps of AB's list … "if you have complicated data, lots of it, anything you don't know [e.g., data that was created before you started working on the project]."

TE: documents don't discuss marked sections! (Marked sections in docuement instance, that is — LB suggests using general entity references declared based on the value of TEI.XML.)

TE: no place are SGML declarations mentioned. (Need to mention to use your local declaration for SGML processing).

ACTION: LB to check whether osx reads SGML declaration, in particular whether it acts on NAMECASE GENERAL NO. Answer: 1.5 seems to do it right if you specify the SGML declaration on the commandline as you're supposed to.

The tech document should discuss that osx will only preserve case if you specify an SGML declaration that specifies case sensitivity.

Action 3: AK investigate what tidy does with ampersands (TE to send problematic file & output listing) 2003-03-04

Action 4: SB show how to use Emacs/psgml as pretty printer 2003-03-20

LB points out we should point out the disadvantages of using dirty hacks. We should put some effort into overcoming the obvious reasonable objections to the off-the-shelf tools.

Discussion on XSLT engines. Consensus is to state that we have successfully transformed an X big document with tei2tei.xsl using [software we use, probably xsltproc].

SB points out that the ‘@@ hack’ not needed now that osx does not expand entities; CR points out that it's needed to protect expansion from XSLT stylesheet to correct case.

SB wonders why osx doesn't fix case. After explained to JH, she thinks this feature might be added in future.

LB reminds us (JH in particular) that fix to ‘attribute bloat’ ¹ problem is in stack, too.

CR & SB point out that the discussion of that batch script should be more plug & play. ²

TE points out that we do not mention anything about public identifiers; SB adds DOCTYPEs in general. LB points out that this is mentioned on sgml2xml page on site, could be used as a starting point.

Catalogs:

Action 5: JH research issue of XML catalogs — should we be recommending folks convert their catalogs too? 2003-03-29

At the very least, we'll need some sort of discussion of catlogs.

Commenced 10:30

JH reports on osx updates.

Action 6: SB send URL to archives of -E to -R 2003-02-15

CR raises issues with osx: [?...?]. Input files in EUC (a Unix Japanese double-byte encoding). osx can process them (with a -b switch); problem is that it gave an error message, even though it seemed to work. JH points out that ‘this is technically an OpenSP issue, not an osx issue. Which is to say, I will definitely not be able to fix it. I will take care of harrassing the OpenSP folks about it, though.’ In some other case got gibberish out

Action 7: CRsend input & output of osx with EUC-JP encoding problems to JH

. When tried to convert to UTF-8 first, got lots of errors.

Suppressing output of built-in entity references has been written but not checked in; supression of default attributes is on JH's ‘to-do’ list.

CP: using SX, has been relatively smooth as pretty simple data pretty well normalized. Points out that her parser complains about "<l/>" in the output.

Action 8: CP report parser & error or warning to list 2003-02-21

JW asks if putting up lists of entity names & Unicode codepoints for n2x would be helpful. SB says yes, but not much. SB points out that users find it difficult to find the ISO entity sets for Unicode on the website. Consensus is that MI W 03 should contain an explicit reference.

NS: TEILite, quite well normalized. Easy translation. Had used osx & xsltproc.

CP & CR bring up a company called Intellex; NS mentions Apex. CR thinks they might be helpful in taking a look at our documents and providing feedback. Perhaps raw SGML with lots of minimizations.

Action 9: editors raise with board the issue of getting small companies (Intellex, Chadwick-Healey, Alexander Street Press, Harpweek (John Adler)) to contribute financially 2003-02-18

JW: VWWP is also TEILite, well normalized; created new entity sets using XHTML versions as a base and adding Greek with diacritics by themselves.

SB reports on WWP extension experience. FW & SB point out that

persName fix
globincl -> Incl
dual-purposing of entity sets (using %TEI.XML;)

should be explicitly mentioned in TR's section of MI W 03. LB points out some of this is already mentioned in P4:2002 Appendix C.2

LB reports that information about the BNC migration is now on website ( http://www.tei-c.org/MI/Samples/BNC/ ), not much to add. OTA is working on this problem, but LB not equipped to report on it. (A lot of OTA stuff is not in TEI anyway.)

CR: main barrier to conversion is error-prone SGML on input. LB suggests that we have a discussion of document management issues: e.g., keeping track of changes (e.g., in <revisionDesc> ).

CR discusses what a repository rep report should contain.

MI W 06 will be the collected case reports, "Migration Case Study Reports".

Action 10: SB take CRs outline of MIW06 make a TEI Lite template. 2003-02-15

Action 11: SB Make outline of MI W 02 a TEI Lite template and send sections to section authors.

Action 12: -R group Submit case study write-ups to the list 2003-03-10

Next meeting tentatively Mon 16 & Tue 17 Jun 2003 (all day Mon 16, half day Tue 17) in Alicante Spain. Thanks to AB for volunteering to host the next meeting. Note that Alicante is having a big festival on Mon 23 & Tue 24.

On the subject of MI W 06, ‘conversion that maximizes XML tool usability’ seems to mean whether you convert from external entities to XInclude type of stuff.

Current survey plans (aka ‘small survey’):

LB sends mail to Projects on activities page
Action 13: LB send draft letter to lists 2003-02-28

Action 14: LB provide list of e-mail addresses to those on activities page to CR 2003-03-14
Of those that use P3 SB & CR send letter for general info
Those who are willing to help out will be divvied up among the MIGR group members, who will send 'em the survey, with the expectation of an ongoing dialogue.

Action 15: SB roll both lists (TEI-MIGR-E and -R) into one ASAR

Action 16: FW Investigate XMetal as a migration tool and report to list 2003-02-21

Action 17: authors submit sections of MI W 02 2003-04-05

Appendix A:

Adjourned ~16:15.

Appendix B: Outline of MI W 02

MIW02: Strategic Considerations in Migration of TEI Documents from SGML to
XML

Introduction [from beginning of MIW03d] [AB]

Challenges, opportunities, and motivation [SS,JW]
        Motivation
                Why it's a good idea
                Why do this now?
                Will make conversion to P5 possible
                comparison to cost of converting proprietary formats
        Opportunities
                Easier to find programers and tech people
                Reduced production costs
                leverage related standards (X...)
                Software
        Challenges
                Expanded file size
                people time (including training)
                new software, processing system (new procedures, workflow)

Types or scope of migration [CP]
        P3->P4
        P4->P4
        Levels of encoding (eg TEIlite v. full P3 or P4)

Areas of migration [JH]
        Document instances
        DTD extensions
        Catalog files
        Processing environment

General recommendations [AB]
        Things to think about before you start
                Workflow issues
                Training
                Consider resources (staff, software, time)
        Use checklist
        Make a backup
        Use an incremental approach
        Check your migrated docs in your new processing environment

Special considerations in migration [CR]
        Easy conversion
        "minimally invasive" conversion
        conversion that maximizes XML tool usability

Appendix: potential impact of future versions of the Guidelines on
migration issues [Eds.]

Notes

Term coined by CP — problem of attributes that are declared as having a default value in the DTD being output in every instance of the element.

By which we mean more interchangable modules; e.g. ‘If you need to do X, use piece of code Y ’.