CE W 03: Use cases for the Character Encoding Extensions
Contents
- Overview
- Chinese: Adding characters to the document character set (CBETA)
- Old Norse: Providing a common set of additional glyphs and ligatures (Menota)
- [not language specific] Precombined alternatives for characters expressable only as combined forms
Overview
In order to get a handle on the requirements and extend of a mechanism to extend the character encoding of a TEI document, this documents collects use cases. Some of these cases are taken from actual existing projects and practice, others are constructed by the editors and contributors of this document.
In the following listing, the heading includes information about which language is to extended, what purpose the extension has and which project, if any, is actually using it. What follows is a prose description of this use case. The concluding paragraphs attempts an evaluation, including any problems this approach might have for text encodders.
1Chinese: Adding characters to the document character set (CBETA)
The Chinese Buddhist Electronic Text Association (CBETA) is compiling an electronic version of the Chinese Buddhist Canon. This is a large scale undertaking and so far digitized versions of 56 volumes of printed text, or about 80 million characters have been realized. The project uses an adapted version of the TEI DTD and makes the master XML files, as well as other derived formats available on the Web and on CD-ROM. More information (for readers of Chinese) is available at www.cbeta.org.
- mappings to other, larger encoding schemes (Unicode and Mojikyo Font Institute),
- information about "normalized" versions of this character,
- a glyph expression that describes this character using other characters and a small set of operators
- radical, stroke count, four-corner number and other properties
- dictionary references
- readings of the characters, if known.
Each of the texts includes a entity replacement table that gives expansions of the entity references. At the early stages of the project, the database was used to directly generate these tables before parsing of the text. Depending on the purpose of the parsing (for example the target format), different expansions where used.
Our conversion scripts will then access the information they need and use that.
In this case, there is much un-documented information (or information not presented in a coherent and machine readable form) and a lot of the information is implicit in the business logic of the various scripts used. This is the kind of information, that would need to move to some future WSD or similar extension mechanism to be usable according to a coherent model.
Old Norse: Providing a common set of additional glyphs and ligatures (Menota)
The Menota project has not only produced some excellent Guidelines for the encoding of Old Norse, (see http://www.hit.uib.no/menota/guidelines/), but also gave birth to the ‘Medieval Unicode Font Initiative’, which tries to enumerate and systematize the encoding units needed for the transcription of Old Norse medieval manuscripts.
No. | Name of range | Inventory | Allocated span |
1 | Mixed script characters | 19 | E000 - E0FF |
2 | Precomposed diacritical characters [NB! very large page] | 183 | E100 - E1FF |
3 | Small capitals | 19 | E200 - E2FF |
4 | Enlarged minuscules | 28 | E300 - E3FF |
5 | Ligatures | 15 | E400 - E4FF |
6 | Punctuation marks | 4 | E500 - E5FF |
7 | Base line abbreviation marks | 15 | E600 - E6FF |
8 | Combining abbreviation marks | 11 | E700 - E7FF |
9 | Precomposed abbreviated characters | 8 | E800 - E8FF |
10 | Superscript (interlinear) characters | 22 | E900 - E9FF |
11 | Metrical symbols | 12 | EA00 - EAFF |
12 | Critical and epigraphical signs | 4 | EB00 - EBFF |
Total number of characters included in this proposal 340 |
A closer look at the tables reveals that many of the encoded units can be represented in Unicode with ligature joiners, combining diacritics and the like. Some others, like Small Capitals, enlarged minuscules, superscript characters are a combination of a specific rendering requirement and a character, it seems to me that these features might be better represented with markup.
This project defines a large number of PUA codepoints, many of them could be syntactic sugar for features already expressable with Unicode combining characters and markup. To accomodate this kind of usage (which might very well be considered a user requirement), the WSD-NG will have to record the mapping between these shortcuts and the standard Unicode represenation. Furthermore, the issue of how to decide whether to use specialised characters or markup should be raised.
[not language specific] Precombined alternatives for characters expressable only as combined forms
As a generalization of the Menota case, it could be said that most display engines and operating systems still assume a 1:1 mapping between code-point and glyph in a font. While Unicode will consider combining characters together with a base character as a sufficient definition, in practice it is often required or at least desirable to have a precomposed form of a character. The TEI WSD-NG should also be able to provide the mapping between the canonical standard form of the Unicode Standard and a project-specific precombined form. (Although in such cases the canonical Unicode form should be used for interchange).