TEI META Task Force: Status Report [MEW01]
Contents
- Choice of schema language for the TEI
- Skeleton work plan for redesigning ODD
- Work plan beyond ODD: towards P5
- Work on ODD markup
- Datatyping in attributes
- Character encoding in attributes
- Namespaces and fragment inclusion
- A replacement for the Pizza chef
- Appendix A: Tables
Choice of schema language for the TEI
- XML DTD language as at present
- W3C Schema language
- OASIS Relax NG language
- A new notation of the TEI's devising
- Uses XML syntax, enabling easy validation and analysis
- Is very readable, and fairly easy to relate to DTD
- Is well-implemented by different processors, and so immediately useable
- Uses W3C schema datatyping
- Seems likely to be included in the forthcoming ISO DSDL
- Can be converted to W3C schema if needed
Sebastian Rahtz presented a paper at XML Europe 2002 on the subject of how to convert the TEI XML content models to RelaxNG. This work, slightly refined, is the basis for an experimental version of TEI P5. There are a set of derived sample TEI schemas which are available for immediate use.
Skeleton work plan for redesigning ODD
- Clear up the details of <tagDoc>
- Revise (part 1) the TSD tagdoc and make it a standard ‘topping’
-
Convert (part 1) the Guidelines to conform to that schema (*)
- Convert the <elemDecl> contents to RelaxNG schema (*)
- Convert attributes (where automatically possible) to use new datatyping scheme (*)
- Add new <entDoc> elements defining the datatypes (*)
- Examine and rework the <string> and <entDoc> elements to remove remaining SGML/XML material
-
Rewrite and test the scripts which
- generate schemas (*)
- generate DTDs
- generate HTML version of the Guidelines
- generate PDF version of the Guidelines
- Clear up the details of higher-level class system
- Revise (part 2) the TSD tagdoc
- Convert (part 2) the Guidelines to conform to that schema
Work plan beyond ODD: towards P5
- Make corrections of known errors
-
Assess all the attribute datatypes and
decide whether:
- A new datatype should be created (when more than 2 or 3 attributes have the same pattern)
- An attribute which is now simple text should be reconsidered as a tokenized attribute
- Extra facets should be added to further refine datatypes
- Assess elements to see whether those with plain text bodies can be datatyped
- Consider all element content models to decide whether they are too restrictive or too lose; consider whether some of the simplifying facilities available in RelaxNG (eg whether <interleave> ) should be used.
Work on ODD markup
<attList>
- <attDef> has a boolean attribute ‘required’
- the <default> element should only be used to hold genuine default strings or tokens. It will be optional. Some notation will be needed to encompass ‘%INHERITED;’
-
<datatype>
has a mandatory ‘target’ attribute, which
points to an
<entDoc>
, defining the datatype.
This gives us
an extra abstract layer over XML schema datatypes. Most
token choice attributes would be boiled down to
genuine datatypes, so all of ‘Y|N’, ‘yes | no’ and
‘true|false’
would be
<datatype target="BOOLEAN">
. In the
<entDoc>
,
we expound on this and map to the relevant W3X Schema
datatype (see section Datatyping in attributes).
Where the choice is limited, eg ‘A | B’,
it is recorded as a set of enumerated values,
defined in the body of the
<datatype>
:
<datatype target="TOKEN"> <rng:choice> <value>A</value> <value>B</value> </rng:choice> </datatype>
Datatyping in attributes
The task force is asked to use W3C Schema datatypes in the TEI as much as possible.
- Standard XML datatypes (ID, IDREFS, NMTOKENS, etc)
- Abstract datatypes linked to entities in the Guidelines (there are only 2 or 3 of these)
- Text with no conditions
- Text, but with a fixed set of possibilities
- attributes where the range of possibilities fits a W3C datatype, or it makes sense to at least have a common set of values across the TEI
- attributes which really should have token values
Name | Relax NG representation |
ANYURI | <rng:data type="anyURI"/> |
BOOLEAN | <rng:data type="boolean"/> |
DATE | <rng:data type="date"/> |
DATETIME | <rng:data type="dateTime"/> |
DURATION | <rng:data type="duration"/> |
ENTITIES | <rng:data type="ENTITIES"/> |
ENTITY | <rng:data type="ENTITY"/> |
EXTPTR | <rng:text > |
FLOAT | <rng:data type="float"/> |
FORMULA | <rng:text > |
ID | <rng:data type="ID"/> |
IDREF | <rng:data type="IDREF"/> |
IDREFS | <rng:data type="IDREFS"/> |
LANGUAGE | <rng:text > |
NAME | <rng:data type="NCNAME"/> |
NMTOKEN | <rng:data type="NMTOKEN"/> |
NMTOKENS | <rng:data type="NMTOKENS"/> |
SEX | <rng:choice <value>m</value> <value>f</value> <value>u</value> <value>x</value> </rng:choice> |
TEXT | <rng:text > |
TIME | <rng:data type="time"/> |
TOKEN | <rng:empty/> |
UBOOLEAN | <rng:choice> <value>true</value> <value>false</value> <value>unknown</value> <value>unspecfied</value> </rng:choice> > |
Table Figure 2 lists some current datatype values and how they map to the new scheme. Table Figure 3 shows 180 attributes which can automatically given a non-text and non-token data types.
Character encoding in attributes
- Record which attributes have the extended property of being representable as elements
- When making normal DTDs, only support the ‘traditional’ scheme of attributes
- Allow for special DTDs (from son-of-pizzachef) which support only the element alternative
- When making schemas, support both attribute and element forms
There are over 300 attributes which currently have a text datatype; this includes a good many elements which have a type attribute. The TEI editors will have to decide which of these should be classified as ‘true text’ (see EDW79).
Namespaces and fragment inclusion
- Using fragments of another markup language in TEI XML
- Using fragments of TEI in another markup language
- All existing TEI documents would be invalid, as they would be in an empty namespace. It would be a fairly small fix for each instance to add a namespace declaration to root element, but that would make it fail with existing DTDs.
- All existing XML processing tools would fail to work with new documents; for instance, XSLT stylesheets which process a current (empty namespace) <TEI.2> would fail to identify the new <TEI.2 xmlns="http://www.tei-c.org/P5"> . It will be possible in XSLT 2.0 to write a stylesheet to work with both old and new TEI documents, but using XSLT 1.0 it will be much harder; all stylesheets will need a large rewrite.
A replacement for the Pizza chef
- Adding elements which do not simply follow the class system, but have arbitrary content models and attribute lists. The problem here is how to ask the user to specify the new material without directly writings schema code. It remains to see how many requests we will receive for this feature.
- Changing or limited the content model of elements which do not follow the class system fully. The correct answer to this may be to revise the TEI so that all elements do use the class system 100%, but in the short-term this is unrealistic. It may be possible to devise an interface to editing content models.
- Adding entire classes to the TEI. This is a complex matter, which it is unlikely we can provide in a simple web interface.
Appendix A: Tables
Current | New datatype | (values) |
%ISO-date; | DATE | |
%extPtr; | EXTPTR | |
%formulaNotations; | FORMULA | |
Y | N | BOOLEAN | |
Y | N | U | UBOOLEAN | |
YES | NO | BOOLEAN | |
all | one | none | TOKEN | all, one, none |
all | some | none | TOKEN | all, some, none |
free | unknown | restricted | TOKEN | free, unknown, restricted |
light | sound | prop | block | TOKEN | light, sound, prop, block |
m | f | u | SEX | |
m | f | u | x | SEX | |
none | some | all | TOKEN | |
silent | tags | TOKEN | |
y | n | u | UBOOLEAN | |
yes | no | BOOLEAN | |
Y | N | I | M | F | TOKEN | Y, N, I, M, F |
Y | N | U | UBOOLEAN | |
Y | N | partial | TOKEN | Y, N, partial |
Y | N | BOOLEAN | |
Y | N | BOOLEAN | |
a | m | j | s | u | TOKEN | a, m, j, s, u |
am | pm | 24hour | descriptive | TOKEN | am, pm, 24hour, descriptive |
audio | video | TOKEN | audio, video |
closed | semi | open | TOKEN | closed, semi, open |
composite | uniform | TOKEN | composite, uniform |
data | rend | std | nonstd | unknown | TOKEN | data, rend, std, nonstd, unknown |
eq | ne | TOKEN | eq, ne |
eq | ne | gt | ge | lt | le | TOKEN | eq, ne, gt, ge, lt, le |
eq | ne | lt | le | gt | ge | TOKEN | eq, ne, lt, le, gt, ge |
eq | ne | sb | ns | TOKEN | eq, ne, sb, ns |
eq | ne | sb | ns | lt | le | gt | ge | TOKEN | eq, ne, sb, ns, lt, le, gt, ge |
excl | incl | TOKEN | excl, incl |
fiction | fact | mixed | inapplicable | TOKEN | fiction, fact, mixed, inapplicable |
high | medium | low | unknown | TOKEN | high, medium, low, unknown |
horizontal | vertical | TOKEN | horizontal, vertical |
initial | medial | final | unknown | complete | TOKEN | initial, medial, final, unknown, complete |
internal | external | TOKEN | internal, external |
int | real | TOKEN | int, real |
lexical | punc | lexpunc | digit | space | DL | LD | dia | joiner | other | TOKEN | lexical, punc, lexpunc, digit, space, DL, LD, dia, joiner, other |
location-referenced | double-end-point | parallel-segmentation | TOKEN | location-referenced, double-end-point, parallel-segmentation |
model | atts | both | TOKEN | model, atts, both |
new | update | TOKEN | new, update |
none | partial | complete | inapplicable | TOKEN | none, partial, complete, inapplicable |
pe | ge | TOKEN | pe, ge |
perc | real | TOKEN | perc, real |
req | mwa | rec | rwa | opt | TOKEN | req, mwa, rec, rwa, opt |
role | list | TOKEN | role, list |
root | branches | TOKEN | root, branches |
s | w | ws | sw | m | x | TOKEN | s, w, ws, sw, m, x |
silent | tags | TOKEN | silent, tags |
single | composite | frags | unknown | TOKEN | single, composite, frags, unknown |
single | set | bag | list | TOKEN | single, set, bag, list |
smooth | latching | overlap | pause | TOKEN | smooth, latching, overlap, pause |
tei | iso | national | private | none | TOKEN | tei, iso, national, private, none |
tempo | loud | pitch | tension | rhythm | voice | TOKEN | tempo, loud, pitch, tension, rhythm, voice |
to | from | both | none | TOKEN | to, from, both, none |
unit | set | bag | list | TOKEN | unit, set, bag, list |
y | n | unspecified | UBOOLEAN | |
y | n | BOOLEAN | |
yes | abb | init | TOKEN | yes, abb, init |
yes | no | BOOLEAN | |
yes | no | BOOLEAN | |
CDATA | TOKEN | |
ENTITIES | ENTITIES | |
ENTITY | ENTITY | |
ID | ID | |
IDREF | IDREF | |
IDREFS | IDREFS | |
NAME | NAME | |
NMTOKEN | NMTOKEN | |
NMTOKENS | NMTOKENS |
element | attribute | datatype |
analysis | ana | typeIDREFS |
declarable | default | typeBOOLEAN |
declaring | decls | typeIDREFS |
dictionaries | location | typeIDREF |
dictionaries | mergedin | typeIDREF |
dictionaries | opt | typeBOOLEAN |
edit | resp | typeIDREF |
formPointers | target | typeIDREF |
global | id | typeID |
global | id | typeID |
global | lang | typeIDREF |
interpret | inst | typeIDREFS |
linking | corresp | typeIDREFS |
linking | synch | typeIDREFS |
linking | sameAs | typeIDREF |
linking | copyOf | typeIDREF |
linking | next | typeIDREF |
linking | prev | typeIDREF |
linking | exclude | typeIDREFS |
linking | select | typeIDREFS |
pointer | targOrder | typeUBOOLEAN |
pointerGroup | domains | typeIDREFS |
readings | hand | typeIDREF |
TEIform | TEIform | typeNAME |
terminology | grpPtr | typeIDREF |
terminology | depPtr | typeIDREF |
timed | start | typeIDREF |
timed | end | typeIDREF |
xPointer | doc | typeENTITY |
xPointer | from | typeEXTPTR |
xPointer | to | typeEXTPTR |
abbr | resp | typeIDREF |
add | resp | typeIDREF |
add | hand | typeIDREF |
addSpan | resp | typeIDREF |
addSpan | hand | typeIDREF |
addSpan | to | typeIDREF |
admin | date | typeDATE |
alt | targets | typeIDREFS |
app | from | typeIDREF |
app | to | typeIDREF |
arc | from | typeIDREF |
arc | to | typeIDREF |
att | tei | typeBOOLEAN |
birth | date | typeDATE |
catRef | target | typeIDREFS |
catRef | scheme | typeIDREF |
cell | rows | typeNONNEGATIVEINTEGER |
cell | cols | typeNONNEGATIVEINTEGER |
certainty | target | typeIDREFS |
classCode | scheme | typeIDREF |
damage | resp | typeIDREF |
damage | hand | typeIDREF |
date | value | typeDATE |
del | resp | typeIDREF |
del | hand | typeIDREF |
delSpan | resp | typeIDREF |
delSpan | hand | typeIDREF |
delSpan | to | typeIDREF |
distance | exact | typeUBOOLEAN |
docDate | value | typeDATE |
eLeaf | value | typeIDREF |
eTree | value | typeIDREF |
event | who | typeIDREF |
event | iterated | typeUBOOLEAN |
expan | resp | typeIDREF |
f | fVal | typeIDREFS |
fAlt | mutExcl | typeBOOLEAN |
figure | entity | typeENTITY |
form | codedCharSet | typeIDREF |
form | entityStd | typeENTITIES |
form | entityLoc | typeENTITIES |
formula | notation | typeFORMULA |
fs | feats | typeIDREFS |
fsdDecl | fsd | typeENTITY |
gap | resp | typeIDREF |
gap | hand | typeIDREF |
gi | tei | typeBOOLEAN |
gloss | target | typeIDREF |
graph | order | typeNONNEGATIVEINTEGER |
graph | size | typeNONNEGATIVEINTEGER |
handShift | new | typeIDREF |
handShift | old | typeIDREF |
handShift | resp | typeIDREF |
iNode | value | typeIDREF |
iNode | children | typeIDREFS |
iNode | parent | typeIDREF |
iNode | ord | typeBOOLEAN |
iNode | follow | typeIDREF |
iNode | outDegree | typeNONNEGATIVEINTEGER |
join | targets | typeIDREFS |
keywords | scheme | typeIDREF |
kinesic | who | typeIDREF |
kinesic | iterated | typeUBOOLEAN |
language | iso639 | typeLANGUAGE |
language | wsd | typeENTITY |
leaf | value | typeIDREF |
leaf | parent | typeIDREF |
leaf | follow | typeIDREF |
link | targets | typeIDREFS |
move | who | typeIDREFS |
move | perf | typeIDREFS |
msr | value | typeFLOAT |
msr | valueTo | typeFLOAT |
nbr | value | typeFLOAT |
nbr | valueTo | typeFLOAT |
node | value | typeIDREF |
node | adjTo | typeIDREFS |
node | adjFrom | typeIDREFS |
node | adj | typeIDREFS |
node | inDegree | typeNONNEGATIVEINTEGER |
node | outDegree | typeNONNEGATIVEINTEGER |
node | degree | typeNONNEGATIVEINTEGER |
note | anchored | typeBOOLEAN |
note | target | typeIDREFS |
note | targetEnd | typeIDREFS |
occupation | scheme | typeIDREF |
occupation | code | typeIDREF |
pause | who | typeIDREF |
person | sex | typeSEX |
personGrp | sex | typeSEX |
ptr | target | typeIDREFS |
q | direct | typeUBOOLEAN |
rate | value | typeFLOAT |
rate | valueTo | typeFLOAT |
ref | target | typeIDREFS |
relation | active | typeIDREFS |
relation | passive | typeIDREFS |
relation | mutual | typeBOOLEAN |
respons | target | typeIDREFS |
restore | resp | typeIDREF |
restore | hand | typeIDREF |
root | value | typeIDREF |
root | children | typeIDREFS |
root | ord | typeBOOLEAN |
root | outDegree | typeNONNEGATIVEINTEGER |
setting | who | typeIDREFS |
shift | who | typeIDREF |
socecStatus | scheme | typeIDREF |
socecStatus | code | typeIDREF |
sound | discrete | typeUBOOLEAN |
sp | who | typeIDREFS |
span | from | typeIDREF |
span | to | typeIDREF |
state | length | typeNONNEGATIVEINTEGER |
step | length | typeNONNEGATIVEINTEGER |
step | from | typeEXTPTR |
step | to | typeEXTPTR |
supplied | hand | typeIDREF |
symbol | terminal | typeBOOLEAN |
table | rows | typeNONNEGATIVEINTEGER |
table | cols | typeNONNEGATIVEINTEGER |
tag | TEI | typeBOOLEAN |
tagUsage | occurs | typeNONNEGATIVEINTEGER |
tagUsage | ident | typeNONNEGATIVEINTEGER |
tagUsage | render | typeIDREF |
tech | perf | typeIDREFS |
teiHeader | date.created | typeDATE |
teiHeader | date.updated | typeDATE |
time | value | typeTIME |
timeline | origin | typeIDREF |
timeRange | from | typeTIME |
timeRange | to | typeTIME |
tree | arity | typeNONNEGATIVEINTEGER |
tree | order | typeNONNEGATIVEINTEGER |
triangle | value | typeIDREF |
u | who | typeIDREFS |
unclear | hand | typeIDREF |
vAlt | mutExcl | typeBOOLEAN |
vocal | who | typeIDREF |
vocal | iterated | typeUBOOLEAN |
when | since | typeIDREF |
witDetail | target | typeIDREFS |
writing | who | typeIDREF |
writing | script | typeIDREF |
writing | gradual | typeUBOOLEAN |
writingSystemDeclaration | date | typeDATE |