TEI MI W 02Strategic Considerations in Migration of TEI Documents from SGML to XML
Contents
- Introduction
- Motivation, Opportunities, and Challenges
- Areas of Migration
- General Recommendations
- Special Considerations in Migration
- Appendix: Potential Impact of Future Versions of the Guidelines on Migration Issues
Introduction
This report and its companion, TEI MIW03: Practical Guide to Migration of TEI Documents from SGML to XML (hereafter the Practical Guide) were created by the TEI Task Force on SGML to XML Migration. The Task Force, chartered by the TEI Council with funding from the National Endowment for the Humanities, was comprised of representatives from projects with significant TEI SGML experience, selected technical experts, and the TEI editors. The group worked for over a year to diagnose and document the problems, methods, and tools necessary to design and effect a migration of existing TEI resources from P3 SGML to P4 XML.
The two reports, MIW02 and MIW03, summarize the group's findings and recommendations. Although they complement one another, they are addressed to different audiences. This report is a strategic report, intended for administrators and project managers; it discusses migration issues from a managerial perspective, with an emphasis on planning and decision-making. The Practical Guide is a technical report that describes the mechanics of conversion in greater detail, providing solutions to specific conversion problems as well as a recommended conversion workflow, and it is written primarily for the technical staff who will implement the conversion. Its specific recommendations are augmented by a set of migration case study reports that discuss individual migration efforts undertaken by members of the Task Force.
- Motivation, Opportunities, and Challenges discusses some of the many excellent reasons for migrating legacy SGML data to XML, and argues that conversion is well worth the effort despite the challenges involved.
- Areas of Migration describes the components of a document production environment — document instances, DTD and customization files, catalog files, and the processing environment — and outlines in general terms how each area must be addressed in a migration to XML.
- General Recommendations describes the migration planning and workflow design process. It suggests strategies for analyzing legacy SGML data, allocating resources for migration, automating the conversion, and verifying the results.
- Special Considerations in Migration discusses different degrees of migration complexity, from easy conversions that aim for simple XML conformance to more robust conversions that look forward to advanced XML functionality and future versions of the TEI Guidelines.
- An appendix, Potential Impact of Future Versions of the Guidelines on Migration Issues, describes some of the changes that are likely to appear in P5, the next iteration of the TEI Guidelines, and how the anticipation of these changes might impact a project's migration strategy.
Motivation, Opportunities, and Challenges
Motivation
For individuals and repositories holding TEI P3 SGML-encoded texts, now is the opportune time to migrate texts into XML. The P3 specification has been superseded by the P4 specification, and is no longer formally supported by the TEI Consortium. TEI P4 was developed to be compatible with documents produced under earlier TEI specifications. Thus any document conforming to the original TEI P3 SGML DTD will be compatible with an XML version based on P4. Moreover, as can be seen from the Migration Case Studies, the conversion from P3 to P4 is relatively straightforward. This Working Group has identified appropriate software and scripts to help simplify the process, so now is an ideal time to migrate.
The ongoing development of P5, the next iteration of the Guidelines, will render P3 increasingly obsolete. TEI P5 will be XML-based and will not ensure backwards compatibility, so a P3 to P5 migration may be substantially more difficult than P3 to P4. In fact, the TEI Consortium does not intend to provide a conversion path from P3 to P5. Having TEI P4-conformant XML texts will make life much simpler should a P5 migration become necessary (see Potential Impact of Future Versions of the Guidelines on Migration Issues below for further details).
Opportunities
Migrating data can provide a number of benefits. Many projects have been working with the same SGML DTD for many years and may need to re-examine it. Migration provides an opportunity to revisit DTDs and encoding practices which were developed to facilitate searching or display in a particular SGML-based system but are no longer necessary in an XML-based system. It also creates an opportunity to parse data again and fix errors.
Perhaps one of the more compelling reasons for a project or individual to consider migrating data is the scarcity of SGML-aware software and tools and the relative abundance of XML-based tools. Indeed, as XML becomes the industry standard there is a real danger that SGML-aware software will no longer be supported. SGML also lacks a suite of related standards that allow full exploitation of the encoded data. XML, on the other hand, is accompanied by a number of related standards and specifications, including the following:
-
DOM and SAX
DOM and SAX are programming APIs for processing XML data. Multiple DOM and SAX implementations are available for most modern programming languages.
-
XML Schema (W3C, Relax NG, and
others)
Newer schema technologies, such as W3C XML schema and Relax NG (ISO 19757-2:2002), may be used instead of, or in addition to, DTDs to define vocabulary constraints and validate documents. These schema technologies offer advantages and additional functionality (such as familiar XML syntax and strong data-typing) not found in DTDs.
-
XPath and XPointer
XPath and XPointer are two closely related technologies that provide languages for pointing to specific parts of an XML document. XPath is also an integral component of XQuery and XSL.
-
XQuery
XQuery provides a standard language for querying XML documents and data. XQuery is supported by the major XML and relational database vendors and open source projects. Like SQL, its counterpart in the relational database sector, XQuery should eliminate the need for developers to learn multiple proprietary query syntaxes.
-
XSL (XSLT and XSL-FO) and CSS
XSL and CSS provide greater possibilities for the display of data on the Web, conversion from a TEI DTD to other XML formats (such as XHTML, DocBook, or Open eBook), and conversion to non-XML formats such as PDF or PostScript.
-
XLink
XLink provides a standard, attribute-based method for describing links in XML documents. The XLink standard provides support for simple, single-direction links (similar to HTML's <a> element) as well as more complex bi- and multi-directional links.
-
XML Namespaces
XML Namespaces provide a mechanism for distinguishing between elements and attributes that have the same name but are from different XML vocabularies, allowing different vocabularies to be combined in a single XML document.
This is a partial list of major XML-related technologies, and there are many more that may well prove useful to developers and users of TEI content. XML, together with its host of related standards, provides a powerful platform for working with encoded texts. SGML cannot take advantage of these technologies.
Finally, due to many of the issues mentioned above, it is likely that migration from SGML to XML will result in lower production costs. There is an amazing amount of high-quality free and open source software for XML, some of which is readily available on the TEI Software page. The number of people with XML expertise is also growing rapidly, so reductions in training time and costs seem likely.
Challenges
While it is clear that there are substantial benefits in migrating TEI-encoded documents from SGML-conformant P3 to XML-conformant P4, there are some very real obstacles to be considered. The most profound of these is the expense, particularly in the short run. There will be training costs for new software, new encoding practices, and new skills (learning to write XSLT stylesheets rather than DSSSL, for example). Staff resources will be required for evaluating and testing new software or publishing systems, and for decisions regarding workflow during the conversion process. There is also the expense of migration into a new publishing system, and, indeed, the expense of researching and building the new system itself. (Workflow issues are discussed in the General Recommendations section of this report, as well as in the Case Studies.)
The long-term benefits, however, seem clear. XML has become an industry standard and provides real financial benefits to many humanities computing projects running on shoestring budgets. Not only are the various XML standards non-proprietary, but there are already several free native XML databases. The TEI Consortium has already made resources available for XML-encoded texts, such as XSLT stylesheets that convert TEI-encoded texts to HTML. As more projects adopt XML, more stylesheets, scripts, and other resources will undoubtedly become available.
Areas of Migration
The meat of your migration is likely to be converting your document instances, but there are other tasks to consider. You may need a new DTD, which might be the most recent TEI DTD or a custom model generated with the Pizza Chef tool (if that is how your old DTD was generated). You will also need to convert any DTD customizations and catalog files, and to prepare project members to work in a new processing environment. Some projects will find document instance conversion trivial but have trouble converting their DTD customizations. Others will have no significant customizations and will find conversion simple.
Document instances
A document instance is what is normally thought of as just ‘the document’. In this case, it is an SGML document marked up in TEI P3. Document conversion from SGML to XML is usually relatively straightforward, especially when prefaced by solid groundwork that analyzes mark-up and identifies potential problems. In many cases conversion tools can be used to run conversion automatically or in conjunction with an XSLT script. The process does require a certain degree of human intervention, however, and project staff should take the time to develop a reliable workflow.
DTD customizations
Please note that this documentation discusses converting DTD customizations, not the DTD itself. For the main DTD you should either use the latest version of the TEI DTD or generate a DTD using the Pizza Chef.
Some organizations and TEI users maintain DTD customizations to customize TEI according to their particular needs. A TEI DTD customization is implemented as two files written in DTD syntax. An customization file for P3 is therefore written in SGML DTD syntax and has to be converted to XML DTD syntax in order to work with P4. SGML to XML DTD conversion is more complex than instance conversion. XML DTD syntax does not support as many features as SGML DTD syntax, and it may not be possible to automatically translate an SGML DTD into a useful equivalent in XML DTD syntax. Some TEI element and attribute class names may have changed between P3 and P4, so the conversion would also need to address any resulting problems. For more information about migrating DTD customizations, see Migrating TEI DTD extensions to XML in the Practical Guide.
Catalog files
A catalog file contains information for mapping public identifiers (such as -//TEI//DTD TEI Lite P4 XML 2002-05//EN) to system locations (such as /usr/share/xml/tei/teixlite.dtd). Many SGML parsers use them to find the appropriate DTD when parsing a document instance or other entities, such as ISO sets. If you intend to use your SGML parser to parse your XML documents, you may continue to use your old catalog files. If you intend to use a new XML parser (which cannot parse SGML), you may need to convert your catalog files to XML catalog syntax. 1
Processing environment
Since XML is a subset of SGML, any SGML parser should be able to parse an XML document and its DTD. If you use an SGML parser, it will probably require an SGML declaration that is also appropriate for XML parsing; this declaration is packaged with many SGML parsers, some SGML to XML conversion tools, and is available from the Task Force.
Many new parsers have become available with the advent of XML, and some organizations may prefer the functionality that is only available with new parsers (Java servlet integration, for example). The downside to changing parsers is the necessity of changing existing work processes (for example, the need to modify scripts that expect to talk to the old parser, or the need to re-educate users about different command-line arguments). Note also that XML parsers do not parse SGML.
XML parsers are more likely to stop processing upon encountering an error in the input than SGML processors (the result of some strictures in the XML specifications). It is therefore easier to use an SGML parser to collect errors, fix them all, then attempt to parse again. The more usual workflow with an XML parser may be to catch and fix errors one at a time.
Besides the parser, indexing tools, web output builders, and any tools that need to parse the input should also be evaluated. In addition, the names of the files containing the document instances in the new XML processing environment may change; SGML files typically use either a ‘.sgm’ or ‘.sgml’ suffix, whereas XML files typically use a ‘.xml’ suffix. For projects using source control, this will mean notifying the source control system of the changed filenames.
Note that old SGML ISO entity sets will not work with your new XML documents unless they have been specifically altered for the purpose; they should be replaced with the XML equivalents, available in a variety of places, including the TEI website..
General Recommendations
This section identifies and discusses key issues that you should consider before migrating your documents. SGML to XML migration of literary documents is a data migration process like any other, and some care should be taken to anticipate possible problems and reduce the potential risks. A bigger and more complex document collection is more likely to have higher migration costs and risks, so a proportional amount of planning is required.
Homogeneity and collection size
The degree of difficulty a project will encounter in migration is determined at least in part by its technical diversity. If all of the documents in a collection use a common DTD (such as teilite.dtd) and have approximately the same level of markup complexity, then the migration process is fairly simple. You may be able to develop a single migration process for the whole collection and even automate it to run in batch mode. A project with multiple DTDs that use TEI-compatible customizations would be more difficult: document conversion from SGML to XML can still be automated, but some time should be reserved for building a new XML DTD with the required customizations.
The most challenging scenario is a big repository with several collections and multiple DTDs that are not fully TEI-compliant (e.g., DTDs with modifications that do not match the customizations mechanism). Document conversion in this case may be time-consuming and complex, so the project staff should be sure to reserve extra time and resources for unexpected problems.
Target production environment
It would be wise to define, test, and tweak your target XML processing environment before migrating whole sets of documents in order to avoid snags in processing and maintaining the resulting XML files. If possible, choose components for your working XML environment first, and be sure to test, integrate, and fine-tune all tools, so that you can start full production of new documents and maintain or reprocess old SGML documents immediately after migration.
Be sure to check your test documents in the new processing environment before migrating the whole lot and to make sure that your processes (e.g., validation, error-checking, printing procedures, transformation to HTML) will work in XML.
Training and XML tools
If the project staff is already comfortable with SGML, they should have no problem adjusting to XML. The new XML tools are more user-friendly, so your staff may actually find them easier to work with. They will still require proper training, however, and the chance to learn about related technologies such as XSL for rendering and transforming XML documents.
There are many good XML tools on the market (editors, parsers, scripting languages), but since new tools come out regularly and the Task Force does not wish to endorse any products, this document will not discuss tools in any detail here. There is a tools page within the TEI website where specific tools are reviewed, and some migration tools are discussed in the section Instance Conversion: Tools in the Practical Guide.
Migration strategy
When planning your migration strategy, you should first consider factors such as the size of your SGML collection, available time and staff, and the complexity of your markup. As noted above, bigger collections and more complex mark-up will take longer to migrate. Early in the planning process, you will need to decide whether to migrate existing SGML texts while simultaneously producing new texts in XML or to stop production of new materials until all texts have been migrated and then restart production in XML only. The first option supports continuous production, but it takes longer and is harder to finish. However, if you have a large, complex set of documents and a tight production schedule, this may be the best choice. The second option may seem drastic, but if your migration set is small and relatively simple and your project can tolerate a halt to production, it is a reasonable choice. It may be helpful to take care of training needs and execute a few practice runs before stopping production. The migration may take longer than expected, but you will be able to quickly and efficiently deal with any unexpected complications.
There are a wide variety of possible workflow models for migration, and some will work better than others. See the Migration Case Study Reports for examples of various different scenarios.
Other recommendations
-
Checklists
Checklists are exceedingly useful for remembering details and enforcing strict attention to procedures. They are also helpful for delegating tasks and regulating workflow amongst a larger staff. A sample checklist to compare and classify sample TEI SGML documents and their properties is provided in the Technical Checklist for TEI/SGML documents (MIW04). This checklist is designed to help assess the overall complexity of the migration and to highlight any particular areas of difficulty.
-
Backup copies
If your project follows the usual procedures for data protection, you should already have a backup of your SGML files. If you don't, or if you don't trust the backup you have, make a backup copy of your SGML files before you start migrating them to XML. If something goes awry in the conversion process, you will be protected against any substantial loss of data.
-
Incremental approach
Design the migration process after a series of small tests. You might first test the procedure on a single small file, then adjust it and resolve detected problems until it performs reasonably well. Try it on a small set of files, then a bigger set, resolving problems as they arise, until it seems reliable.
-
Check the results
The XML output should be verified, in part to check for errors that may have been introduced during the migration process, and in part to confirm that all expected elements are in the expected locations. There are several features which should be verified:
- Well-formedness ensures that the document complies with the basic rules of XML and can therefore be processed without problems by XML tools. XML editors and parsers will usually do this basic verification automatically.
- Validity proves that the document complies with the markup rules stated by the corresponding DTD. This verification is performed by validating parsers, included in most XML editors. It would be a good idea to implement a batch process to automatically check well-formedness and validate all migrated documents.
- A robust quality assurance process will catch any errors introduced by the conversion that cannot be detected through parsing and validation alone. Your project should already have quality control procedures in place; these should be adapted for analyzing the newly converted XML documents. See Migration Workflow in the Practical Guide for some additional suggestions.
Special Considerations in Migration
The migration strategies in this report should be broadly applicable to all TEI users. However, some TEI repositories are larger and more complex than others, and data managers have varying amounts of money, time, and programming expertise to devote to migration. Different goals also suggest different approaches — some projects will require only a minimal conversion (to allow them to get started with XSLT, for example), while others will undertake a more comprehensive conversion, perhaps to facilitate a future transition to TEI P5. Each institution will want to weigh the constraints and desired results and plan their conversion accordingly.
For example, an institution with a very large collection of legacy SGML data or a small technical staff may decide that migration to XML is feasible only if the conversion process can be fully automated. The procedure for document instance conversion given in the Practical Guide describes a generic batch script that uses out-of-the-box tools (osx and an XSLT processor) to automatically transform a batch of source files with minimal human intervention. This batch process is simple, effective, easily customizable, and requires very little technical expertise to implement. Such a process would not reap all of the potential rewards of a more thorough conversion, but would suffice to bring SGML collections into XML so that new tools and XML-related technologies could be exploited.
While this approach will accurately preserve the information set of the source documents, it may alter certain layout and formatting characteristics, cause defaulted attributes from the DTD to be included in the output, or expand internal and external entities in a way that alters the structure of the output documents. The resulting XML files are technically equivalent to the source files, but they may be less ‘human-readable’ and therefore more difficult for data managers and their staff to edit and maintain. When converting legacy data to XML, many repository managers want to preserve the appearance and structure of the original SGML files as much as possible. In some cases, they rely on whitespace or other features outside the information set for local processing and delivery; such workflows are notoriously unreliable and should ideally be phased out as part of the conversion process. However, preserving readability and modularity are legitimate goals that require a ‘minimally invasive’ approach to conversion. Such an approach makes only those changes necessary for XML conformance and generates results documents that are as much like the source documents as possible with respect to whitespace, comments, attributes and entities. The Migration Workflow section of the Practical Guide describes many of the techniques used in such a conversion.
While an easy or minimal conversion will result in basic XML compliance, the migration process can provide an opportunity to thoroughly re-evaluate legacy data and make more comprehensive changes. Some projects will therefore use this opportunity to undertake a robust conversion that looks forward to future developments in XML and its related standards and prepares for the changes that will be introduced in the P5 guidelines.
Appendix: Potential Impact of Future Versions of the Guidelines on Migration Issues
Even as we work on the issues involved in migration from TEI P3 (SGML) to TEI P4 (XML), the TEI is also busy working on revisions to the Guidelines that will eventually become P5. While we cannot predict exactly what P5 will be like, and therefore cannot pinpoint the exact migration issues that will be encountered, we can predict some issues, and give some general advice.
Although P5 will not describe the same encoding system, but rather a new, somewhat different, and (we hope) improved one, it will nonetheless describe an XML system, and hence we anticipate that much of the migration from P4 (XML) instances to P5 will be easily automatable using XML-to-XML transforms (e.g., XSLT). Migration from P4 (SGML) to P5 would therefore likely begin by transforming to XML. For this reason the current set of migration recommendations are almost exclusively about migration to P4 (XML) and not P4 (SGML).
Schema Language
Just as P4 is itself written in XML using DTDs, and its encoding system can be (partially) machine-enforced using XML DTDs, P5 will be itself written in XML using RelaxNG, and its encoding system will be (partially) machine-enforced using RelaxNG, XML DTDs, or W3C XML Schema (XSD); the base language in which the constraints will be expressed will be RelaxNG.
While switching schema languages for such a large project as the TEI involves an enormous amount of work, the advantages should be quite significant. First and foremost, RelaxNG and XSD will permit better machine validation of TEI documents. These schema languages are capable of expressing a variety of constraints that DTDs cannot: e.g., that the content of a <date> first child of a <change> in the <revisionDesc> is in fact a valid date, or that if a to attribute is specified on some sort of linking element, then a from attribute must be specified also. (The latter is currently a restriction placed on the elements of the class xPointer by the prose of P4, but unenforceable by the DTDs.)
Furthermore we expect that one important result of using RelaxNG as the base constraint language will be that the TEI system will be easier to read, understand, modify, and extend. In particular, the TEI class system which was (brilliantly) jury-rigged via an intricate system of DTD parameter entities should be much more naturally expressible in RelaxNG, and therefore easier for end users to understand and thus take advantage of.
- to represent unusual characters (e.g., é may be used to represent ‘é’)
- to easily manage chunks of boiler-plate text (e.g., ©right; may be used to encode ‘Copyright © 2001 Tax Executives Institute, Inc.’)
- to manage keeping a large document in separate files (e.g., &license; may be used to refer to the system file /usr/share/common-licenses/GPL-2)
Representing unusual characters has become a very different animal with the advent of XML, as XML files generally use ISO 10646 aka Unicode character encoding. For many of the characters for which entity references have been used in the past, direct entry of the character may now be possible (e.g., just typing in the ‘©’). For those system and character combinations for which direct entry is not possible or not desirable, a numeric character reference can be used instead (e.g., ©). While there is no doubt that an entity reference like é is far more mnemonic than the equivalent numeric character reference, each method should produce the desired result.
When entity references were used to refer to entire system files, it is likely that users of P5 who wish to validate with RelaxNG or W3C Schema will want to replace them with either XInclude elements ( http://www.w3.org/TR/xinclude/ ) or some combination of XInclude and XIndirect ( http://www.w3.org/TR/XIndirect .
When entity references have been used to manage boilerplate text, users who no longer wish to use entities will likely either wish to use XInclude elements or locally created special elements and stylesheets.
ID/IDREF, the native XML mechanism for pointing within a document, is not available to users of RelaxNG or W3C Schema. (Note however that there exists an approved extension to RelaxNG which does support ID/IDREF, for which there is at least one implementation.). Thus other mechanisms, most likely schemes of XPointer, will be available in P5 for pointing. Document systems created using these other mechanisms will require validation of their pointers with software other than the XML parser. While it is not yet entirely clear what TEI pointers will look like, it is very likely that software to validate them will be available prior to the publication of P5, and that migration of instances using ID/IDREF to the new system will be automatable.
While supporting user customizations written in any of the available schema languages (which will be DTD, RelaxNG, and XSD for the foreseeable future) is a laudable long-term goal, it is almost assured that at initial release of P5 only mechanisms for writing customizations in RelaxNG will be in place. How long it takes to support other languages depends in large part on whether or not there is user demand, and how difficult it turns out to be.
- http://www.relaxng.org/tutorial-20011203.html
- http://www.relaxng.org/compact-tutorial-20030326.html
- http://www.zvon.org/xxl/XMLSchemaTutorial/Output/index.html
Modules
The TEI system comprises a variety of tag-sets or modules. For P5 we expect the entire system to be somewhat simpler (e.g., no more auxiliary modules), but we also expect there to be a few new modules. In particular, we are hoping that modules for manuscript description, graphics and multimedia, and document authoring will be available. While the existence of new modules will not be an impediment to simple migration, it may well be the case that users will find some of the new material useful, and will prefer to re-encode some features using the new modules instead of performing a straightforward migration.
Several modules, including those for feature structures and terminological databases, are undergoing or likely to undergo major revisions. It is even possible that the module for terminological databases will be dropped completely, although it seems unlikely. More likely some sort of recommendations that are compatible with ISO 16642 (Terminological Markup Framework) will replace the current chapter.
Migrating to P5
The first thing to remember when considering the migration to TEI P5 is that P4 is not going to disappear. Unlike P3, P4 will continue to be supported for years after P5 is published. However, no new development work will go into P4. All new encoding solutions will appear for P5 without corresponding P4 modules. So if you are not anxious to take advantage of the superior validation available via the schemas of P5, do not need to take advantage of any of the new material in P5 (e.g. manuscript description), and do not plan to change your customizations (which will be easier in P5), there will be no rush to migrate at all.
It is possible that P5 instances will look quite different from their P4 counterparts, and it is a given that no instance that is valid against P4 will be valid against P5 without significant changes. However it is also likely that these significant changes will be all but completely automatable via an XML-to-XML transform (e.g. with XSLT).
As observed above, P4 user DTD customizations, however, will need to be re-written in RelaxNG. Although it is not likely that this will be an easily automatable process, it is likely that it will be far easier than it was to write them in the DTD language in the first place.
If you are migrating from a solid existing TEI P3 (SGML) base, you may be wondering whether to migrate to P4 at all, and be considering going directly to P5. While there is nothing specifically wrong with going directly to P5, it is likely that very little, if anything, would be lost by going through P4 first, and the potential gain (being able to use XML-based tools in the interim) is significant. Furthermore, the TEI does not have any plans to provide direct support for P3 to P5 migrations. Since it will be months before P5 is finalized, any attempt to migrate to P5 now runs a high risk of requiring alteration later. Migrating to P4 now, and then to P5 at a leisurely pace later on, is likely to be a lot less work.