Anyone for pizza?
Designing a TEI document type declaration

The T E what?
Originally, a research project within the humanities
Sponsored by ALLC, ACH, ACL
Funded 1990-1994 by US NEH, EU LE Programme et al
Major influences
  digital libraries and text collections
  language corpora
  scholarly datasets
Now an international membership consortium incorporated Jan 2001

Current TEI activity
Preliminary XML version of DTD now available
Text update in progress
Workgroups under consideration
character set issues
manuscript description
modelling
lexica and termbanks
physical description
resource discovery
Membership will vote by end 2001

Who uses TEI?
digital librarians and archivists
HTI, UVA, CETH, OTA...
Language Engineering projects
EAGLES, BNC, MULTEX, ECI, Silfide
academic researchers
Women Writers Project, CURIA Project, VWWP, Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library...
http://www.hcu.ox.ac.uk/TEI/Applications/

Goals of the TEI
better interchange and integration of scholarly data
support for all texts, in all languages, from all periods
guidance for the perplexed:  what to encode
assistance for the specialist:  how to encode any information of interest
Hence a loose framework into which unpredictable extensions can be fitted

Legacy of the TEI
a way of looking at what “text” really is
a codification of current scholarly practice
(crucially) a set of shared assumptions and priorities about the digital agenda:
focus on content and function (rather than presentation)
generic solutions (rather than application-specific  ones)

TEI Deliverables
A set of recommendations for text encoding,
covering both generic text structures and some highly specific areas based on (but not limited by) existing practice
A very large collection of element  definitions
combined into a very loose document type declaration
A mechanism for creating multiple views (DTDs) of the foregoing
One such view and associated tutorial: TEI Lite; others exist (e.g. CES, BNC)

The TEI modus operandi...
 identify significant particularities
independent of notation or realization
 avoid controversy, over-delicacy, inadequacy
 seek generalizable solutions, acceptable to a consensus

... and some consequences
 focus on content, not presentation
 descriptive, not prescriptive
 Occam's razor
 modular, extensible dtd

Designing a dtd for the TEI
How can a single markup scheme handle a large variety of requirements?
all texts are alike
every text is different
Learn from the database designers
one construct, many views
each view a selection from the whole

How Many dtds?
How many dtds might the TEI require?
one (the Corporate or WKWBFY approach)
none (the Anarchic or NWEUMP approach)
as many as it takes (the  Mixed Economy or XML approach)
Or a single main dtd with many faces (the British approach)

The TEI solution: modularization
a (very) large number of element and attribute definitions
organized as tagsets (core, base, additional, or auxiliary)
grouped into classes

How to combine Tag Sets…
all tag sets, all the time (the table d'hôte model)
a few pre-selected combinations (the combination plate model)
in completely unconstrained abandon (the smørgasbord model)
one from column A, two from column B (the Chinese menu model)

The Chicago Pizza Model
<!ENTITY %  base “(deepDish | thinCrust | stuffed)” >
<!ENTITY % topping “( pepperoni | mushrooms | sausage | pepper | anchovies | ...)”  >
<!ELEMENT pizza   - -
( %base;,  tomatoSauce & cheese,  %(topping))  >

To build a view of the
TEI dtd, take...
the core tagsets
the base of your choice
the toppings of your choice

TEI base tagsets
one only must be selected
defines basic structural components
currently defined:
prose, verse, drama
transcribed speech
dictionaries
terminological databases
mixtures of bases require special treatment

TEI additional tagsets
sets of elements for specialized application areas
can be mixed and matched ad lib
currently provided:
linking and alignment; analysis;  feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....

For an XML DTD
Just add another declaration to the subset
(This is new in TEI P4)

How does this work?
enables all declarations within the tagset marked section defined in the main TEI dtd
these may include element, attribute, and class definitions

How does this work?
Within the main DTD:
the declarations making up each tagset are enclosed by an IGNORE marked section
the declarations for each element are enclosed by an INCLUDE marked section
this can be over-ridden by your declaration within the DTD subset

Customizing the TEI DTD
In DTD subset
selection of tag sets
specification of document entities
in TEI.extensions.ent
renaming of elements
suppression of elements
modification of TEI classes
in TEI.extensions.dtd
definition of new elements

Entity definitions
typically will include entity declarations for embedded graphics etc.
may also invoke special characters etc.

To modify the dtd
Define your modifications in a pair of extension files

In your extension files you can…
rename elements
<!ENTITY % n.p “para” >
undefine elements
<!ENTITY % seg “IGNORE”>
The pizzaChef gives you a list of all the elements available from your chosen tagsets, and generates extension files for you

You can also
supply additional (or replacement) declarations
supply entirely new elements and embed them in the architecture

An example
In the DTD subset we write:
 <!ENTITY % TEI.prose "INCLUDE">
 <!ENTITY % biblStruct "IGNORE">
In the prose tagset it says:

Finally, the pizza is cooked
The carthage program removes
parameterization in the DTD
unreferenced or inaccessible elements
The pizzachef website
http://www.tei-c.org/pizza.html
command line equivalent:
http://www.tei-c.org/maketeidtd/

Element Classes
Most TEI elements are assigned to one or more
model classes, identifying their syntactic properties, or
attribute classes,  identifying their attributes
This provides a (relatively) simple way of
documenting and understanding the DTD
parameterizing content models
facilitating customization
An alternative way of doing architectural forms

Some TEI model classes
divn: structural elements like divisions (<div>,<div1>, <div2>…)
divtop: elements which can appear at the start of a divn element (<head>, <epigraph>, <byLine>…)
chunk: paragraph-like elements (<sp>, <p>, <lg>…)
phrase: elements which appear within chunks  (<hi>, <foreign>, <date> …)

Implementation of classes
Each model class is defined as a pair of parameter entities
Reference to class members is always indirect

Class mobility
Each model class is defined as a  parameter entity, containing
a reference to an initially null extension class
a list of members
To add a new member to a class, we redefine the extension class:

TEI attribute classes
global: attributes which are available to every element (n, lang, id, TEIform)
linking: attributes for elements which have linking semantics (targType, targOrder, evaluate

The TEIFORM attribute
protects applications from the effect of element renaming
    <titre TEIform="title">...</titre>
protects applications from the effect of syntactic sugar
     <abc type="xyz”> can be rewritten as
     <xyz TEIform="abc">

TEI Auxiliary DTDs
independent dtds for specialized information:
writing system / character set
feature system (for feature-structure notation)
tag set documentation
independent, free-standing TEI header

What can go wrong?
extensions must use SGML syntax
beware of zombie elements
beware of over zealous pruning
remember that some TEI rules are not enforced (or enforceable) by the DTD
You have to know what's on the menu before you can choose from it

A case study: the Lampeter  corpus
Fairly typical requirements for historical language corpora:
light presentational tagging
structural markup for access
detailed information about source text production
small number of tags to ease data capture and validation
Implementation
tagsets: prose base, and tags from four additional sets
some extensions, many exclusions

The Lampeter corpus DTD subset
<!DOCTYPE teiCorpus.2 SYSTEM "tei2.dtd"[
<!ENTITY % TEI.prose  "INCLUDE">
<!ENTITY % TEI.corpus "INCLUDE">
<!ENTITY % TEI.figures "INCLUDE">
<!ENTITY % TEI.transcr "INCLUDE">
<!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd">
]>

The Lampeter corpus extensions.ent
<!ENTITY % analytic 'IGNORE' >
<!ENTITY % biblStruct 'IGNORE' >
<!-- hic desunt multa -->
<!ENTITY % supplied 'IGNORE' >
<!ENTITY % x.phrase      "it|ro|sc|su|bo|go|">
<!ENTITY % x.biblPart      "printer|pubFormat|bookSeller|">
<!ENTITY % x.demographic "socecstatusPat|biogNote|">

The Lampeter corpus  extensions.dtd
<!ELEMENT it      (%phrase.seq;)>
<!ELEMENT printer (%phrase.seq;)>
<!ATTLIST it %a.global; >
<!– etc.for all other new elements -->

To finish the job
Document your extensions, using the TEI tagset for tagset documentation
Write a manual using the ODD system to generate your DTD fragments

Why bother?
The TEI is a well-known reference point
Using the TEI enables
sharing of data and resources
shared modular software development
lower learning curve and reduced training costs
The TEI is stable, rigorous, and well-documented
The TEI is also flexible, customizable, and extensible  in documented ways
The architectural approach offers the best compromise for practical work.