TEI: Migrate to the P5 Guidelines


Contents

Because the TEI is constantly developing to support more advanced encoding and more complex data, at intervals TEI projects may need to migrate their data and systems forward into a new version of the TEI Guidelines. The last migration process accompanied the release of P4 in 2002, when the TEI changed its underlying representation from SGML to XML. With the release of P5 in November 2007, the Guidelines changed again. Some of the most significant changes are architectural: in P5 the Guidelines themselves are stored and written using a different technology and the TEI schemas are expressed not only as DTDs but also in the RELAX NG schema language. Some of the changes affect the vocabulary and constraints of the TEI encoding language: adding new elements, improving content models, and in some cases adding entire chapters covering new material. More detail on the changes is available at the P5 page.

The information below is intended to answer some basic questions about migrating from P4 to P5. For more detailed information, please post questions to the TEI-L discussion list, or search the list archives.

When should you migrate to P5?

If your current P4 system is working and you are happy with it, there is no rush to migrate. The TEI Consortium currently plans to support P4 for another five years, until November 2012. Conversely, as soon as some change in your environment (for instance, a new version of your XML editor or XSLT processor) causes your P4 system to break, or if you are planning improvements or changes to your current encoding environment, it is a good idea to consider migrating to P5 as an alternative to fixing or changing the P4 system.

While P4 will have formal support for 5 more years, it is likely that a significant majority of TEI users will migrate long before then. Thus if you are looking for answers to questions, or to use someone else's stylesheets, or to hire programming assistance, it will probably be easier to find other users who are familiar with P5 than P4.

What's involved in migration?

For a full-scale encoding operation there are several steps involved in migration. Some of these may not apply to smaller-scale or individual users.

  1. First, ascertain whether you want to migrate your current encoding system exactly as it stands, or whether you want to make changes. Migration is a good time to revisit your encoding system and assess whether it still fits your needs. You may want to take advantage of the new features of P5 to capture additional information, or there may be features that you no longer find it useful to encode.
  2. Next, you may need to migrate your DTD and any extensions. That is, you need to have your current P4 DTD as a P5 schema. For TEI Lite projects this will typically involve no more than an hour or two with Roma constraining the possible values of the @type attribute. For projects that have developed their own TEI extensions, schema migration will involve creating a new ODD file (using Roma or a similar tool) based on your .ent and .dtd extension files. This process will probably take up to a few days of work, and will require some familiarity with Roma and with the TEI extension and customization mechanisms. While this process probably cannot be automated, it is likely that those who have already done this will be quite willing to advise and assist, and we anticipate that the TEI-L list will be a useful forum for discussion on this issue.
  3. Next, you will need to make changes to any work processes that are specific to your P4 markup and DTDs, and adapt them to work with P5 markup and schemas. Depending on your work flow, this may include changes to stylesheets, automated pre- or post-processing, documentation, conversion scripts, and other tools and systems. XML editing tools that work with DTDs may not work (or may not work in the same way) with schemas and this may necessitate some changes to your work flow. As above, this may be an opportunity to rethink your work flow and take advantage of new tools and systems.
  4. The last practical step in migration is to migrate your XML data, and the difficulty or ease of this process depends on the decisions you made at the start concerning the nature of your intended P5 encoding. If you are simply converting P4 markup into a P5 equivalent, the process is largely automatable and not difficult. If you are planning to add markup or to restructure the way you encode certain features, this may or may not be automatable, and will depend on the nature of the new markup you intend.
  5. Finally, we recommend that you share the results of your efforts, even if only informally, by posting a report on the process to TEI-L or on your own site. This knowledge may be of great use to others undertaking the same kind of migration.

Migrating your instances

The TEI Consortium expects that, for many projects, most steps involved in migrating your XML data files from P4 to P5 will be automatable. Many simple documents can be migrated by making the following changes:
  • Changing the root element from <TEI.2> to <TEI>
  • Adding the TEI namespace declaration to <TEI>
  • Changing <xref> to <ref> , and <xptr> to <ptr> , and changing the link attribute to point to URLs
  • Changing pointer values on existing <ptr> and <ref> elements from IDREFs to URLs; this usually means prefixing the existing value with a # character
  • Converting the encoding of <abbr> , <orig> , <sic> , and their counterparts ( <expan> , <reg> , and <corr> ) to use the P5 <choice> element
  • Changing @id and @lang to @xml:id and @xml:lang
  • Changing the @value attribute of <date> to @when
However, there are a few steps that will require human intervention, either because they cannot be automated, or because no one has yet written a program to do so. Some such areas are:
  • The values of @lang must be converted to BFC 47 format for the values of @xml:lang; this is not automatable in the global sense, but typically will be automatable at the project level, if needed
  • Normalized dates (e.g., values of @value of <date> ) that are not already in W3C or ISO 8601 formats will need to be changed. In general this is automatable, but as there are so many possible formats, TEI has no plans at present to supply a general-purpose conversion tool. Project-specific tools will probably not be difficult to write.
  • Although a proof-of-concept Perl script for converting extended pointers into XPointers exists, it is non-trivial to install, and has not been tested thoroughly.

TEI has two efforts in place already to assist in instance migration. The first is an XSLT stylesheet, p4top5.xsl , written by Sebastian Rahtz which covers the simpler aspects of conversion in a single transformation.

In addition, the TEI has established a space on the TEI wiki dedicated to migration tools contributed by the TEI community. It includes a collection of small XSLT stylesheets, each intended to transform a single difference between P4 and P5. We welcome contributions of stylesheets to address specific migration issues that projects encounter.

Migrating P4 DTDs

Migrating P4 DTDs and DTD extensions is a more complex process and depends greatly on the kinds of choices represented in the DTD. For P4 DTDs that are constructed simply by include specific tag sets (for instance, the tag set for Names and Dates, and the tag set for Simple Analytic Mechanisms) the process is quite simple. Using Roma, you can make a similar selection of modules, and generate a P5 schema that represents essentially the same set of elements as your original P4 DTD. There will be some differences, of course, because the P5 modules include changes and updates (some of which are described above). Once you have generated your new P5 schema and done an initial conversion on your document instances you should validate all of your files and see whether there are discrepancies that need to be address through further changes to your XML, or by customizing the P5 schema.

If your P4 DTD included any customizations, the process is somewhat more involved. If the extensions consist simply of eliminating elements from the DTD, or specifying attribute values, you can make these changes through the Roma interface. It is particularly important that you develop value lists for any attributes that were expressed as CDATA in P4; these are now classed as the datatype data.enumerated and must have a restricted set of values. If your DTD extension included any new elements, you will need to write an element specification for these using the ODD language.