A progress report from the TEI Character Encoding working group

Overview

Character encoding is the basic transportation layer for all texts
It encodes abstract characters, no other information (ideally:-)
In XML documents, the only choice for character encoding is Unicode
Some characters in Unicode are control characters or do in other ways interfere with markup
More of this is discussed in: Martin Dürst and Asmus Freytag, Unicode in XML and other Markup Languages at: http://www.unicode.org/unicode/reports/tr20/

A universal character set, jointly developed by The Unicode Consortium and ISO/IEC JTC 1/SC 2/WG 2
As of Unicode 3.2 (March 2002) more than 94000 characters are encoded.
Characters are identified by their names (except Chinese, Japanese, Korean characters)
XML can use a subset, but not something completely different.

Since P3, the TEI Guidelines provide a mechanism to declare
- The language of a document or a part thereof
- The script used to write that language
- The encoding used to serialize that script into files
- The declaration of characters used beyond those provide by that encoding
All these functions are bundled together in the WSD.

Language/Script/Encoding are lumped together and can not be separately declared
The WSD mechanism is cumbersome and little used
A large part of the WSD has become obsolete with Unicode as the base character set
The extension mechanism relied partly on a glyph registry, which is now defunct

The TEI Guidelines are on track to its first major revision (P5)
No definite schedule has been set
Among other things, this will likely include a schema based version of the constraints of the document structure

The WG is in operation since July 2001
There have been two face-to-face meetings (and a lot of email)
The WG plans to unbundle the language/script/encoding declaration of the WSD
Information about the work and current draft documents are at http://www.tei-c.org/Activities/CE
Currently, there are three proposed modules of WSD-NG:
- A module to provide a syntax for defining new characters
- A module to define properties for characters
- A module for the linguistic description of writing systems

This was the item most hotly debated at the second meeting in Tuebingen
The following suggestions have been discussed:
1. Use Private Use Area (PUA) characters from Unicode (and escape/document them for interchange)
2. Use markup constructs (e.g. elements)
Implementation of both of the above change largely depending on whether or not entity references are available to
The problem with using markup constructs is that these can not be used in attribute values
Since not just characters, but all language properties can not be used in attribute values, the use of attributes in the TEI Guidelines might need some reconsideration

Unicode defines a set of normative properties for its characters:
- Case
- Combining Classes
- Conjoining Jamo (110011FF)
- Decomposition (Canonical and Compatibility)
- Directionality
- Jamo Short Name
- Numeric Value
- Private Use
- Special Character Properties
- Surrogate
- Mirrored
- Unicode Character Names
In addition, there are some informative properties
Text encoders may wish to fine tune these properties
This WSD module should enable to associate new properties or overwrite existing properties of characters

Eric Albright's Design of an electronic method for describing writing systems saves as a good starting point for this module
Work has begun (see CEW05) to enumerate the features needed.
A lot more needs to be done here.
One of the issues here is that text encoders frequently have to deal with two instances of a given writing system:
- The writing system as it was when the text was written
- The modern version of the writing system
A frequent requirement for digital texts is to be able to use (at least) either of these for rendering.

We had a lot of discussion, mostly centering about the first module, ‘character representation’
A lot of work, especially in the other modules, still needs to be done.
It would be helpful for the further work of the WG, if some of the architectural issues open for P5 could be discussed and decided.

Last recorded change to this page: 2007-09-16 • For corrections or updates, contact webmaster AT tei-c DOT org