A progress report from the TEI Character Encoding working group
A progress report from the TEI Character Encoding
working group
Christian Wittern
Contents
- Overview
- Markup and Character Encoding
- Unicode / ISO 10646
- Writing System Declaration (WSD)
- Problems with P3/P3 WSD
- TEI P5
- Towards a new WSD mechanism (WSD-NG)
- Defining new characters
- Character properties
- Linguistic description of writing systems
- Conclusions
Overview
- Markup and Character Encoding
- Extension mechanism in the current TEI Guidelines
- Proposed new extension mechanism
Markup and Character Encoding
- Character encoding is the basic transportation layer for all texts
- It encodes abstract characters, no other information (ideally:-)
- In XML documents, the only choice for character encoding is Unicode
- Some characters in Unicode are control characters or do in other ways interfere with markup
- More of this is discussed in: Martin Dürst and Asmus Freytag, Unicode in XML and other Markup Languages at: http://www.unicode.org/unicode/reports/tr20/
Unicode / ISO 10646
- A universal character set, jointly developed by The Unicode Consortium and ISO/IEC JTC 1/SC 2/WG 2
- As of Unicode 3.2 (March 2002) more than 94000 characters are encoded.
- Characters are identified by their names (except Chinese, Japanese, Korean characters)
- XML can use a subset, but not something completely different.
Writing System Declaration (WSD)
-
Since P3, the TEI Guidelines provide a mechanism to
declare
- The language of a document or a part thereof
- The script used to write that language
- The encoding used to serialize that script into files
- The declaration of characters used beyond those provide by that encoding
- All these functions are bundled together in the WSD.
Problems with P3/P3 WSD
- Language/Script/Encoding are lumped together and can not be separately declared
- The WSD mechanism is cumbersome and little used
- A large part of the WSD has become obsolete with Unicode as the base character set
- The extension mechanism relied partly on a glyph registry, which is now defunct
TEI P5
- The TEI Guidelines are on track to its first major revision (P5)
- No definite schedule has been set
- Among other things, this will likely include a schema based version of the constraints of the document structure
Towards a new WSD mechanism (‘WSD-NG’)
- The WG is in operation since July 2001
- There have been two face-to-face meetings (and a lot of email)
- The WG plans to unbundle the language/script/encoding declaration of the WSD
- Information about the work and current draft documents are at http://www.tei-c.org/Activities/CE
-
Currently, there are three proposed modules of WSD-NG:
- A module to provide a syntax for defining new characters
- A module to define properties for characters
- A module for the linguistic description of writing systems
Defining new characters
- This was the item most hotly debated at the second meeting in Tuebingen
-
The following suggestions have been discussed:
- Use Private Use Area (PUA) characters from Unicode (and escape/document them for interchange)
- Use markup constructs (e.g. elements)
- Implementation of both of the above change largely depending on whether or not entity references are available to
- The problem with using markup constructs is that these can not be used in attribute values
- Since not just characters, but all language properties can not be used in attribute values, the use of attributes in the TEI Guidelines might need some reconsideration
Character properties
-
Unicode defines a set of normative properties for its characters:
- Case
- Combining Classes
- Conjoining Jamo (110011FF)
- Decomposition (Canonical and Compatibility)
- Directionality
- Jamo Short Name
- Numeric Value
- Private Use
- Special Character Properties
- Surrogate
- Mirrored
- Unicode Character Names
- In addition, there are some informative properties
- Text encoders may wish to fine tune these properties
- This WSD module should enable to associate new properties or overwrite existing properties of characters
Linguistic description of writing systems
- Eric Albright's Design of an electronic method for describing writing systems saves as a good starting point for this module
- Work has begun (see CEW05) to enumerate the features needed.
- A lot more needs to be done here.
-
One of the issues here is that text encoders frequently
have to deal with two instances of a given writing system:
- The writing system as it was when the text was written
- The modern version of the writing system
- A frequent requirement for digital texts is to be able to use (at least) either of these for rendering.
Conclusions
- We had a lot of discussion, mostly centering about the first module, ‘character representation’
- A lot of work, especially in the other modules, still needs to be done.
- It would be helpful for the further work of the WG, if some of the architectural issues open for P5 could be discussed and decided.
Last recorded change to this page:
2007-09-16
• For corrections or updates, contact webmaster AT tei-c DOT org