MI W 03Practical Guide to Migration of TEI Documents from SGML to XML
Contents
- Introduction
- Migration workflow
- Instance Conversion: Tools
- Handling SDATA entities in the conversion process
- Migrating TEI DTD extensions to XML
Introduction
This report contains practical recommendations for migrating TEI data from P3 SGML to P4 XML. It provides instructions for performing the conversions described in TEI MIW02: Strategic Considerations in Migration of TEI Documents from SGML to XML, and is written for the programmers or technical staff who will perform the actual conversion. We suggest that non-technical readers should read MIW02 first, so as to better understand the issues involved in the migration. The workflows described in this document are general and may be applied to all TEI projects, so readers interested in specific examples of scripts, batch conversion tools, and the like should consult the software tools folder and the migration case studies that comprise TEI MIW06: Migration Case Study Reports.
As described in the Areas of Migration section of MIW02, a data migration involves several distinct steps: converting document instances from SGML to XML, obtaining an XML DTD and converting any DTD extensions, and modifying the processing environment (including catalog files and applications such as parsers and editors) to accommodate XML. Because instance conversion is often the most substantial part of the migration process, the bulk of this report discusses that topic. Specifically, the first section presents a recommended workflow for instance conversion, while the second section discusses conversion tools. The third section discusses conversion of SDATA entities, which can be one of the trickier aspects of instance conversion. The final section of this report provides a tutorial in converting DTD extensions to XML.
Migration workflow
This section discusses recommended procedures for migrating your data from SGML to XML. It focuses mainly on a schematic workflow for individual document instances but also briefly addresses some other considerations related to your processing environment and DTDs.
Instance conversion workflow
There are four distinct steps for migrating document instances from SGML to XML. First, of course, the documents need to be converted. You will then probably need to normalize the case in your tags (i.e., be sure that the tags use proper capitalization), since XML is case-sensitive and your SGML may not be. You may also wish to format the files to make them easy to read. Finally, we strongly suggest that you develop procedures for checking your results for any unexpected bugs that may have been introduced during the migration process. These steps are discussed below.
SGML to XML conversion
There are various tools available for converting SGML documents to XML, some of which are discussed below. We are currently recommending osx, since it is the best available tool. However osx will not preserve non-significant whitespaces in its output, so the resulting files may be difficult to read; you may wish to run them through a "pretty printing" program to format your XML files.
Correcting case and formatting your output
SGML is not necessarily case sensitive: the SGML declaration sets NAMECASE GENERAL to either YES or NO. If it is set to YES, then generic identifiers (i.e., element names), attribute names, and attribute values (that are tokens) do not need to follow the same case usage as the DTD, so <TEIHEADER> , <teiHeader> , or even <teIHEAder> will parse correctly. If it is set to NO, they will be case-sensitive and the parser will complain about incorrect capitalization. Similarly, the declaration sets NAMECASE ENTITY to either YES or NO, specifying case-sensitivity of entity names. The TEI SGML declaration set NAMECASE GENERAL to YES and NAMECASE ENTITY to NO, so unless a project specifically changed these settings, entity names will be in the correct case, but generic identifiers, attribute names, and attribute value tokens need not be.
XML, on the other hand, is case sensitive. Element names, attribute names, and attribute values that are not CDATA, NMTOKEN, or NMTOKENS must always have correct capitalization. If your SGML documents did not follow appropriate case usage or your XML conversion software did not preserve the correct case, the resulting XML documents will not be valid.
A normal but often unexpected side effect of instance conversion is the inclusion of default attributes from the DTD in the XML output. For instance, every TEI element has a TEIform attribute, for which the default value is the same as the generic identifier of the element. This attribute rarely needs to be specified in a typical TEI document. However, many conversion tools, including osx, cannot distinguish between defaulted attributes from the DTD and those explicitly defined in the document instance, and will include both in the XML output. Thus <P>Hello World!</P> will become <p TEIform="p">Hello World!</p>. The inclusion of default attributes will not affect validation or XML processing of the files, but it can significantly (and unnecessarily) increase their size, and make the files harder for humans to read.
You can use an XSLT stylesheet, such as tei2tei.xsl, to normalize case and clean up your XML files. It will convert TEI element and attribute names into their proper XML mixed case, remove default attributes such as TEIform, and format the output nicely. If you use osx for your conversion, tei2tei.xsl will probably prove useful (please see the discussion of post-processing tools, below, for more information). Otherwise, you can write your own stylesheet to correct any problems.
Checking your results
It is always a good idea to run a bug check after any data conversion. Certain basic syntax features will be checked by XML validation, but it is likely worthwhile to perform at least spot-checks on random pieces of your output. After ensuring that the output is well-formed XML and then valid TEI, it may be informative to compare the ESIS output from a parse of the original SGML with the ESIS output from a parse of the new XML. You may also want to run the output through XSLT stylesheets that verify certain types of data, format the XML so that the editors can check the output, or even generate HTML pages from the XML.
Batch scripts
You may want to use a batch script to automate these conversion steps, especially if you are working with a large group of documents. For example, you might write a Perl or shell script that backs up a directory of SGML files, then runs each file through an osx conversion, pipes the results of that process through an XSLT processor using the tei2tei.xsl stylesheet, and validates the output before writing the XML file to a new directory. Depending on your particular platform, software, and documents, you may need to take additional steps to preserve entity references through the XSLT process. While it is impossible to include an all-purpose batch script here, a selection of sample scripts developed by the migration work group members is available on the Tools page.
Processing environment and DTDs
The processing environment refers to the software that you use to manipulate your documents. It includes editing tools, parsers, transformation engines, stylesheets, and catalogs. You and your staff should carefully consider what kind of tools will best suit your immediate migration needs as well as your project's long-term development. Tools are discussed below.
Catalogs
A catalog maps external identifiers to URI references. If you have an SGML catalog, you will need to convert it to XML syntax or write a new XML catalog. We will not address this task in any detail, but there is a detailed discussion of XML catalogs in the specifications from the OASIS Entity Resolution Technical Committee.
Character entity references
You may need to take special steps to convert your SGML character entity references to XML, especially if they do not expand to Unicode. Please see Handling SDATA entities in the conversion process, below.
The DTD
You must, of course, have an XML DTD for your new XML documents, so you will need either to migrate your SGML DTD or generate a new XML DTD. If you have been using an unmodified TEI DTD or one generated from Pizza Chef, it is a simple process.
- If you used the SGML TEI Lite DTD (teilite.dtd), you can simply substitute the XML version (teixlite.dtd). The new DTD should be usable to validate your new XML files with no problem.
- If you are using a flattened TEI SGML DTD generated from the Pizza Chef and are not using extensions, you can now go back to Pizza Chef and generate an XML-compliant DTD.
- If you used an SGML DTD with extensions, you will need to migrate the extensions manually and then perhaps regenerate a flat DTD from the Pizza Chef. This is discussed in detail in Migrating TEI DTD extensions to XML, below.
- If you made custom modifications to your SGML DTD by hand, you will have to redo those modifications in an XML DTD. You would be better off using the migration as a chance to review the modifications and determine whether or not they are still necessary. If you decide to keep them, you will want to recreate them using the DTD modification procedures outlined in chapter 29 of the TEI Guidelines, Modifying and Customizing the TEI DTD.
Instance Conversion: Tools
By far the most widely accepted tool in SGML to XML conversion is osx, based on James Clark's sx. 1 Besides being generally accepted, osx is free, readily available, open source software. Therefore, this section will concentrate on osx. Other tools are addressed later, but particular conversion issues using them are not included.
osx
General Information
James Clark originally wrote sx in C++ as a command-line tool and part of his SP package. His version is still available at http://www.jclark.com/sp/index.htm . However, he no longer actively maintains SP; OpenSP, the current recommended version, is maintained as part of the SourceForge OpenJade project. In this distribution, sx is called osx. This task force recommends at least version 1.5.1 of osx.
The osx tool converts only instances, not DTDs. Note that in some cases, such as when notations are declared, osx will generate an internal DTD subset in its output.
Comments
By default, osx does not preserve comments. However, there is an --xml-output-option=comment switch which does preserve comments.
Prolog
The osx tool will output an XML declaration (i.e., <?xml version="1.0" encoding="UTF-8"?>) if your encoding is UTF-8; other output encodings may be requested. When circumstances require the inclusion of an internal subset (on account of entity declarations, notation declarations, etc.), osx will include that as well, with the appropriate root element name, and, by default, no SYSTEM or PUBLIC identifier specified. The user may specify a --dtd_location=dtd-file option, in which case osx will use SYSTEM dtd-file as the external identifier in the DOCTYPE declaration, if one is being output.
Entity preservation
By default, osx resolves all entities (internal and external) and includes the processed result in its output. If the SGML input file includes references to many external entities, the default result will nevertheless be a single output file. The user may request that the file structure be preserved by specifying --xml-output-option=no-expand-external, in which case all included files will be converted and the appropriate entities will be declared and referenced. There is a corresponding --xml-output-option=no-expand-internal to request the preservation of internal entities and their declarations.
SDATA entities
XML does not support SDATA entities, and there is no widely accepted means of expressing them in XML. When asked to preserve internal entities, by default osx treats SDATA entities as general internal entities (i.e., simply replaces their declarations with the equivalent declaration of a general internal entity and preserves all references to them as in the original). The --xml-output-option=sdata-as-pis switch requests that osx instead replace their definitions with a general internal entity, the content of which is a processing instruction (<?sdataEntity entityName entityReplacementText ?>). If osx is expanding internal entities (replacing references to internal entities with the entity replacement text) and SDATA entities are referenced inside attribute values (where markup is not allowed), requesting that an SDATA entity be replaced with a processing instruction may result in output which is not well-formed, so this option should be used with caution. For more information about the issues involved in converting SDATA entities, see Handling SDATA entities in the conversion process, below.
Casing
By default, osx will convert all element names to uppercase. To avoid this incorrect behavior, the --xml-output-option=preserve-case option (available only in versions 1.5.1 and later) should always be specified.
Post-processing
The osx tool will not preserve non-significant whitespace (e.g., space between elements) in its output. If you want your converted files to have neatly wrapped lines and indented elements for human readability, you should use another tool to reformat osx's output. Many XML editing tools will do this.
Sebastian Rahtz's tei2tei.xsl stylesheet will provide some other post-processing (case normalization and removal of attributes which have default values) 3 . It will indent the output nicely, but it will also resolve entity references, which may not be a desired behavior. You could use xmllint to escape entity references first and then tei2tei.xsl to create output with entity references in their original state. Some projects may wish to develop their own stylesheets for post-processing. Note that not all systems will be able to apply an XSLT stylesheet to very large input documents, due to memory limitations.
Other tools
Arbortext's Epic
Arbortext's Epic SGML editor will perform XML conversion, though the conversions cannot be batched and must be run by hand inside the editor.
n2x
n2x is an open source SGML to XML conversion tool written in Python. Instead of accepting SGML input, it expects the output of nsgmls (a James Clark SP tool — the same parser used by sx) or onsgmls (part of the OpenSP distribution), which is an ESIS stream. It converts that stream into XML. It reads an sdata.py file to map SDATA entities into hexadecimal Unicode characters, a useful alternative to osx's policy of resolving them as dictated by the DTD, which is often not what the user wants. As with osx, there is a runtime option not to resolve SDATA entities but instead preserve them as references. n2x is missing options that are available in osx, so while it is not a substitute for osx it may be a good alternative for some situations.
XMetaL
XMetaL, an XML editor, will convert individual SGML instances to XML, though there are some issues with the process. This tool is best used for smaller-scale conversion projects. For more information, go to XMetaL's Help menu, choose the "Contents" tab, then click "Working with files", "Saving a document", and "Switching between XML and SGML." XMetaL works best with TEI Lite or a DTD generated by the Pizza Chef; users have reported problems with other DTDs. By default, XMetaL will output all tags in upper case. You can modify the SGML declaration to prevent this or you can massage the output with tei2tei.xsl, which will normalize case. Unlike many conversion tools, XMetaL will not add defaulted attributes and it does preserve all entity references. You will need to edit the XML output document to remove the SGML internal subset.
Handling SDATA entities in the conversion process
General remarks
SDATA entities are "special entity references" which were available in SGML but do not exist in XML. This section will give some simple recommendations for handling them in the migration from SGML to XML.
In the SGML world, SDATA entities have been used mainly to provide a handle to characters that were not available in the coded character set used by a document instance. Accordingly, this use will be discussed in some detail and other cases will be briefly mentioned.
Public entity reference sets
In XML the document character set is specified as Unicode, which currently includes almost 100,000 characters. John Cowan prepared a list of SGML public entity mappings to Unicode, which is available via ftp at the Unicode website and from OASIS. For entity references in this list the conversion process is straightforward, but other cases may require more tweaking.
Conversion of SDATA entities representing characters that exist in Unicode
Conversion of SDATA entities representing characters that do not exist in Unicode
It should also be noted that a significant number of characters have been added to Unicode since John Cowan's file was created. Specifically, a large number of mathematical symbols have been added, some of which did appear in the ISO public entity sets but did not have a mapping to Unicode when the mapping file was created. A good place to look up characters is the search interface at the Letter database. Another starting point for this type of search is the Unicode Code Charts.
- Assign code points from the private use area (PUA)
- Use markup constructs to represent these characters
Handling SDATA entities that do not represent characters
Besides being used to represent characters, SDATA entities have seen a variety of other, less frequent usages. Due to the variety of uses, which range from recording specific information during data capture or data conversion, to highly technical application-specific information, no general solution can be outlined here. In many cases, however, the conversion to general XML internal entities (which osx will do for you if --xml-output-option=no-expand-internal is specified) should be good enough.
Migrating TEI DTD extensions to XML
General remarks
This section is for projects that have modified the TEI DTD and want to migrate these modifications from SGML to XML (i.e., want to use the XML-based P4 DTD with equivalent modifications). We begin with some general remarks, then describe a sample DTD modification that covers the most important issues, then outline a recommended migration procedure and demonstrate the key steps using the example.
If the elements or content models that the TEI provide don't quite meet the requirements of your project, there is an official escape route: you can modify the DTD in a number of well-defined ways and your documents will remain ‘TEI conformant.’ This involves creating two extension files, setting some parameter entities, possibly defining new elements or redefining existing ones, and making these modifications known to the parser in the DTD subset at the beginning of the document.
Although the process is a lot simpler than it looks at first glance, many people have taken unofficial escape routes, especially the users of the TEI Lite DTD, who would have been required to first switch to a full TEI DTD before applying local extensions. It is admittedly simpler to just open your local copy of teilite.dtd and change a few lines. Only later will you find out why the TEI Guidelines strongly discourage this, and one of those moments could be the migration of your customized DTD to XML.
- Redo your modifications the official way for the P4 DTD, using extension files: find out what is changed in your local copy of the TEI Lite DTD, and create proper extension files for the TEI P4 DTD to the same effect. You will find it is probably easier than you expected, and any future migrations will be easier as well. You'll find useful advice for this process in section 29 of the Guidelines and in the rest of this section. This is the recommended procedure.
- Redo your modifications as before: find out what you changed in the SGML TEI Lite DTD, and apply the same changes to your local copy of the XML TEI Lite DTD. We do not advocate this procedure, but it is, of course, a practical possibility.
- Take a step back: are those modifications really still needed? Were they designed to work around a bug of TEI P3 and are no longer required for P4? (A list of bugs corrected in the P3:1999 edition is available in appendix C3.2 of the Guidelines.) Were they intended for a feature that was never used?
- Deletion of elements
- Renaming of elements
- Extension of classes
- Modification of content models or attribute lists
- redefinition of attribute lists
- modification of existing content models
- definition and integration of new elements (i.e., hanging the new elements into the existing tree)
The following is a short list of some critical issues involved. In the following subsections, we will work through a fictitious example that covers most of these issues;
- Element and attribute name case is significant in XML.
- It is likely that some of the modifications in your existing P3 extension files involved copying (and then probably modifying) pieces of the TEI DTD files. You should check whether those DTD pieces have changed from P3 to P4.
- Some people have made modifications to work around problems in the TEI P3 DTD; if they are fixed in P4, the workaround could cause errors (a notorious example is <persName> ).
- The SGML DTD syntax for element declarations requires two characters of ‘-’ or ‘O’ that indicate whether start and end tags are required or can be omitted. These indicators don't exist anymore in XML DTDs and your private DTD snippets need to be modified.
-
The content model for XML elements is more
restricted than for SGML elements. We won't go into fine
detail, but the following two points deserve attention:
- The only type of character data is PCDATA. You cannot define CDATA content to bypass the parser.
- The inclusion exception and exclusion exception syntax does not exist in XML DTDs. In SGML, you could declare an element to be legal everywhere within element <X> and its children in a single line by using the inclusion exception syntax. This is not possible in XML. Instead, you have to add <X> to all content models individually.
A tutorial example
In this section we will do some simple TEI DTD modifications in SGML. This will then serve as a tutorial example for the migration to XML. While working on this example, the main problems in converting DTD extensions should be covered. Not everyone will need to take all of the steps treated here, and some needs might not be covered, but this should be an easy, hands-on starting point for most projects. 4
- Personal names shall be marked with the <persName> tag (this requires TEI extensions for names and dates and a workaround for a bug in the P3 DTD).
- The <pb> element shall have an extra attribute, ‘imageUrl’, that contains a URL for an image of the page.
- There shall be a new element <ps> for the postscripts of letters, containing normal phrase level content.
- The elements <div1> and <div2> shall be renamed to <volume> and <letter> because our source material is a collection of letters organized that way, and we want to keep that structure and make it explicit;
- An element <toDo> shall be available everywhere in the text for editorial meta-comments on the ongoing encoding. Therefore, the content shall be CDATA to allow easy typing of element names and entities that are talked about in these notes. Content tags and entity references in CDATA are not recognized by the SGML parser.
This example shall be migrated to TEI P4 XML in the next two subsections.
Suggested migration procedure
Although the following step-by-step list may sound over-cautious, this approach is recommended to help those new to the process while converting your DTD and documents. You are switching from P3 to P4, from SGML to XML DTDs, from SGML to XML documents, and from SGML to XML parsers at the same time, and it can be difficult to find your way through the many potential pitfalls.
- Pick an interesting test document from your repository and make sure you can parse it as it is (in SGML form) against your current DTD setup (TEI P3 with your extensions), perhaps making up an example as we did above.
- Set up a parallel SGML parser environment to parse against the TEI P4 DTD. Before you try to parse your sample file, make sure the parser works with very simple standard TEI files.
- Now try to parse your sample document against P4 (in SGML mode). Theoretically, this shouldn't be a problem, but you might encounter errors. Create an intermediate version of your SGML extension files to fix any problems (the cause is most likely that your extensions work around a bug of the P3 DTD that was repaired in P4).
- Create new XML extension files based on the SGML ones. We will do this for our example in the next section. If you want to continue supporting SGML documents, consider making the extension files ‘dual-use’ (i.e., compatible with XML and SGML, as described below) so you only need to maintain one set. If you had to make changes for the previous step (moving from P3 to P4 SGML), you have to decide whether the SGML part of your dual-use setup will be compatible with P3 or P4. Normally, you should be fine using P4 SGML, so this is the assumption in the remaining text.
- If you have created dual-use extension files, use them to parse your SGML document against P4 in SGML mode. Don't proceed until they are correct.
- Make sure that your XML parser is set up properly by parsing a minimal TEI XML document without extensions.
- Convert the test document to XML. A short fictional sample document will make this step easier and will avoid any extra confusion from errors arising in a large-scale text migration.
- Try to parse your converted test XML document. Errors at this stage may be a result of the document conversion or the migrated DTD extensions. Fix them all.
- If you have dual-use extensions, go back and try the SGML parse again.
Migrating the example DTD
We will now focus on rewriting the DTD extensions in XML, with the sample DTD modifications described earlier as a base. We will be creating files called my_xml.ent and my_xml.dtd.
Before we start, however, a strategic decision has to be made: Shall we burn bridges and support only XML in the future? The P4 DTD provides mechanisms to parse both XML and SGML and we can do the same for our customized DTD, if we need to support SGML in parallel for the time being. It takes a little more thought and effort, but in return you get the comfort of a safe transition period. The parameter entity TEI.XML gives you an important tool for accomplishing this: it is defined as INCLUDE in XML mode and as IGNORE otherwise, so that you can segregate XML and SGML entity and element definitions. DTD extensions that use this technique are called ‘dual-use’ extensions in this document. Our example demonstrates both dual-use and pure XML extensions. 5
One obvious syntactic difference between SGML and XML DTDs is the ‘omitted tag minimization parameters’ that appear as ‘-’ and ‘O’ in SGML element declarations to indicate whether start and end tags need to be present or not. They are superfluous in XML, where minimization is not allowed, and therefore are not used. The TEI P4 DTD provides and uses parameter entities om.RO and om.RR as a substitute. For SGML parsing, these entities expand to - - and - O respectively, and for XML parsing they expand to nothing. Elements that require only a start tag (mostly empty elements) use om.RO, and elements that require start and end tags use om.RR (non-empty elements should be defined that way). We can make use of this mechanism for our dual-use DTD extensions. 6
So let's go to work:
Now let's move towards XML: if we first check for consistent case of the element and attribute names in our DTD, we discover that element <ps> was once written in uppercase and once in lowercase. We decide that lowercase shall be the correct spelling.
Some things are easy: the renaming of <div1> and <div2> to <volume> and <letter> , and the declaration of the <ps> element can remain untouched, except that the ‘- -’ in the definition of <ps> needs to be either removed or (if we aim at a dual-use DTD) replaced by ‘%om.RR;’.
The dual-use decision comes up again with the <pb> tag. In XML, we can just write an ATTLIST containing only ‘imageUrl’ and it will be merged with the existing ATTLIST in the TEI DTD files; there is no need to suppress and copy the definition of <pb> . For continuing support of SGML, we have to suppress and redefine the element as before. We find it in the TEI DTD files (teicore2.dtd), copy the P4 definition and modify it, adding our extra attribute.
The most difficult problem is the <toDo> tag. For one, the content model CDATA needs to become PCDATA. This means that existing documents will most probably break, but there is no choice. It might be a solution to turn all the <toDo> content into CDATA marked sections with an automated search and replace as part of the document conversion, or to escape the contained markup to entity references using a similar procedure.
Also, the simple way of allowing <toDo> everywhere is no longer possible. This could be a good occasion to check how that element is actually used in practice and where it is really needed. A compromise in our example could be to add it to the class ‘Incl’ that is part of every content model within <text> . Sometimes, a more complex redefinition of content models could be necessary. If that is your situation, you may want to consult the unofficial TEI document ED W 69, chapter 8 for in-depth coverage.
The first runs with the XML parser result in many warnings because of redefined parameter entities; this is normal. Some syntax correction is required where XML is more strict than SGML: we forgot a semicolon in a parameter entity reference, ‘%paraContent’ must not be in parentheses while the ‘#PCDATA’ for <toDo> has to be. When you don't know how to get rid of an error, it can be useful to browse the TEI DTD files and compare with your own usage.
The cross-check of the dual-use version with the SGML parser exposes a little additional problem: the document now uses the character entities < and > which are predefined in XML, but not in SGML; once discovered this is easily fixed. In the example file, you will find a solution that looks a little complicated but works flawlessly with SGML and XML.