Anyone for pizza?

Designing a TEI document type declaration

The T E what?

Originally, a research project within the humanities

Sponsored by ALLC, ACH, ACL

Funded 1990-1994 by US NEH, EU LE Programme et al

Major influences

digital libraries and text collections

language corpora

scholarly datasets

Now an international membership consortium incorporated Jan 2001

Current TEI activity

Preliminary XML version of DTD now available

Text update in progress

Workgroups under consideration

character set issues

manuscript description

modelling

lexica and termbanks

physical description

resource discovery

Membership will vote by end 2001

Who uses TEI?

digital librarians and archivists

HTI, UVA, CETH, OTA...

Language Engineering projects

EAGLES, BNC, MULTEX, ECI, Silfide

academic researchers

Women Writers Project, CURIA Project, VWWP, Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library...

http://www.hcu.ox.ac.uk/TEI/Applications/

Goals of the TEI

better interchange and integration of scholarly data

support for all texts, in all languages, from all periods

guidance for the perplexed: what to encode

assistance for the specialist: how to encode any information of interest

Hence a loose framework into which unpredictable extensions can be fitted

Legacy of the TEI

a way of looking at what “text” really is

a codification of current scholarly practice

(crucially) a set of shared assumptions and priorities about the digital agenda:

focus on content and function (rather than presentation)

generic solutions (rather than application-specific ones)

TEI Deliverables

A set of recommendations for text encoding,

covering both generic text structures and some highly specific areas based on (but not limited by) existing practice

A very large collection of element definitions

combined into a very loose document type declaration

A mechanism for creating multiple views (DTDs) of the foregoing

One such view and associated tutorial: TEI Lite; others exist (e.g. CES, BNC)

The TEI modus operandi...

identify significant particularities

independent of notation or realization

avoid controversy, over-delicacy, inadequacy

seek generalizable solutions, acceptable to a consensus

... and some consequences

focus on content, not presentation

descriptive, not prescriptive

Occam's razor

modular, extensible dtd

Designing a dtd for the TEI

How can a single markup scheme handle a large variety of requirements?

all texts are alike

every text is different

Learn from the database designers

one construct, many views

each view a selection from the whole

How Many dtds?

How many dtds might the TEI require?

one (the Corporate or WKWBFY approach)

none (the Anarchic or NWEUMP approach)

as many as it takes (the Mixed Economy or XML approach)

Or a single main dtd with many faces (the British approach)

The TEI solution: modularization

a (very) large number of element and attribute definitions

organized as tagsets (core, base, additional, or auxiliary)

grouped into classes

How to combine Tag Sets…

all tag sets, all the time (the table d'hôte model)

a few pre-selected combinations (the combination plate model)

in completely unconstrained abandon (the smørgasbord model)

one from column A, two from column B (the Chinese menu model)

To build a view of the
TEI dtd, take...

the core tagsets

the base of your choice

the toppings of your choice

TEI base tagsets

one only must be selected

defines basic structural components

currently defined:

prose, verse, drama

transcribed speech

dictionaries

terminological databases

mixtures of bases require special treatment

TEI additional tagsets

sets of elements for specialized application areas

can be mixed and matched ad lib

currently provided:

linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....

For an XML DTD

Just add another declaration to the subset

(This is new in TEI P4)

How does this work?

enables all declarations within the tagset marked section defined in the main TEI dtd

these may include element, attribute, and class definitions

How does this work?

Within the main DTD:

the declarations making up each tagset are enclosed by an IGNORE marked section

the declarations for each element are enclosed by an INCLUDE marked section

this can be over-ridden by your declaration within the DTD subset

Customizing the TEI DTD

In DTD subset

selection of tag sets

specification of document entities

in TEI.extensions.ent

renaming of elements

suppression of elements

modification of TEI classes

in TEI.extensions.dtd

definition of new elements

Entity definitions

typically will include entity declarations for embedded graphics etc.

may also invoke special characters etc.

To modify the dtd

Define your modifications in a pair of extension files

In your extension files you can…

rename elements

<!ENTITY % n.p “para” >

undefine elements

<!ENTITY % seg “IGNORE”>

The pizzaChef gives you a list of all the elements available from your chosen tagsets, and generates extension files for you

You can also

supply additional (or replacement) declarations

supply entirely new elements and embed them in the architecture

An example

In the DTD subset we write:

<!ENTITY % TEI.prose "INCLUDE">

<!ENTITY % biblStruct "IGNORE">

In the prose tagset it says:

Finally, the pizza is cooked

The carthage program removes

parameterization in the DTD

unreferenced or inaccessible elements

The pizzachef website

http://www.tei-c.org/pizza.html

command line equivalent:

http://www.tei-c.org/maketeidtd/

Element Classes

Most TEI elements are assigned to one or more

model classes, identifying their syntactic properties, or

attribute classes, identifying their attributes

This provides a (relatively) simple way of

documenting and understanding the DTD

parameterizing content models

facilitating customization

An alternative way of doing architectural forms

Some TEI model classes

divn: structural elements like divisions (<div>,<div1>, <div2>…)

divtop: elements which can appear at the start of a divn element (<head>, <epigraph>, <byLine>…)

chunk: paragraph-like elements (<sp>, <p>, <lg>…)

phrase: elements which appear within chunks (<hi>, <foreign>, <date> …)

Implementation of classes

Each model class is defined as a pair of parameter entities

Reference to class members is always indirect

Class mobility

Each model class is defined as a parameter entity, containing

a reference to an initially null extension class

a list of members

To add a new member to a class, we redefine the extension class:

TEI attribute classes

global: attributes which are available to every element (n, lang, id, TEIform)

linking: attributes for elements which have linking semantics (targType, targOrder, evaluate

The TEIFORM attribute

protects applications from the effect of element renaming

    <titre TEIform="title">...</titre>

protects applications from the effect of syntactic sugar

     <abc type="xyz”> can be rewritten as

     <xyz TEIform="abc">

TEI Auxiliary DTDs

independent dtds for specialized information:

writing system / character set

feature system (for feature-structure notation)

tag set documentation

independent, free-standing TEI header

What can go wrong?

extensions must use SGML syntax

beware of zombie elements

beware of over zealous pruning

remember that some TEI rules are not enforced (or enforceable) by the DTD

You have to know what's on the menu before you can choose from it

A case study: the Lampeter corpus

Fairly typical requirements for historical language corpora:

light presentational tagging

structural markup for access

detailed information about source text production

small number of tags to ease data capture and validation

Implementation

tagsets: prose base, and tags from four additional sets

some extensions, many exclusions

The Lampeter corpus DTD subset

<!DOCTYPE teiCorpus.2 SYSTEM "tei2.dtd"[

<!ENTITY % TEI.prose "INCLUDE">

<!ENTITY % TEI.corpus "INCLUDE">

<!ENTITY % TEI.figures "INCLUDE">

<!ENTITY % TEI.transcr "INCLUDE">

<!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent">

<!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd">

]>

The Lampeter corpus extensions.ent

<!ENTITY % analytic 'IGNORE' >

<!ENTITY % biblStruct 'IGNORE' >



<!ENTITY % supplied 'IGNORE' >

<!ENTITY % x.phrase "it|ro|sc|su|bo|go|">

<!ENTITY % x.biblPart "printer|pubFormat|bookSeller|">

<!ENTITY % x.demographic "socecstatusPat|biogNote|">

The Lampeter corpus extensions.dtd

<!ELEMENT it (%phrase.seq;)>

<!ELEMENT printer (%phrase.seq;)>

<!ATTLIST it %a.global; >

<!– etc.for all other new elements -->

To finish the job

Document your extensions, using the TEI tagset for tagset documentation

Write a manual using the ODD system to generate your DTD fragments

Why bother?

The TEI is a well-known reference point

Using the TEI enables

sharing of data and resources

shared modular software development

lower learning curve and reduced training costs

The TEI is stable, rigorous, and well-documented

The TEI is also flexible, customizable, and extensible in documented ways

The architectural approach offers the best compromise for practical work.


	Originally, a research project within the humanities
		Sponsored by ALLC, ACH, ACL
		Funded 1990-1994 by US NEH, EU LE Programme et al
	Major influences
		digital libraries and text collections
		language corpora
		scholarly datasets
	Now an international membership consortium incorporated Jan 2001


	Preliminary XML version of DTD now available
	Text update in progress
	Workgroups under consideration
		character set issues
		manuscript description
		modelling
		lexica and termbanks
		physical description
		resource discovery
	Membership will vote by end 2001


	digital librarians and archivists
			HTI, UVA, CETH, OTA...
	Language Engineering projects
			EAGLES, BNC, MULTEX, ECI, Silfide
	academic researchers
			Women Writers Project, CURIA Project, VWWP, Orlando, Model Editions Partnership, Canterbury Tales Project, Bodleian Library...
	http://www.hcu.ox.ac.uk/TEI/Applications/


	better interchange and integration of scholarly data
	support for all texts, in all languages, from all periods
	guidance for the perplexed: what to encode
	assistance for the specialist: how to encode any information of interest
	Hence a loose framework into which unpredictable extensions can be fitted


	a way of looking at what “text” really is
	a codification of current scholarly practice
	(crucially) a set of shared assumptions and priorities about the digital agenda:
		focus on content and function (rather than presentation)
		generic solutions (rather than application-specific ones)


	A set of recommendations for text encoding,
		covering both generic text structures and some highly specific areas based on (but not limited by) existing practice
	A very large collection of element definitions
		combined into a very loose document type declaration
	A mechanism for creating multiple views (DTDs) of the foregoing
	One such view and associated tutorial: TEI Lite; others exist (e.g. CES, BNC)


	identify significant particularities
		independent of notation or realization
	avoid controversy, over-delicacy, inadequacy
	seek generalizable solutions, acceptable to a consensus


	focus on content, not presentation
	descriptive, not prescriptive
	Occam's razor
	modular, extensible dtd


	How can a single markup scheme handle a large variety of requirements?
		all texts are alike
		every text is different
	Learn from the database designers
		one construct, many views
		each view a selection from the whole


	How many dtds might the TEI require?
		one (the Corporate or WKWBFY approach)
		none (the Anarchic or NWEUMP approach)
		as many as it takes (the Mixed Economy or XML approach)
	Or a single main dtd with many faces (the British approach)


	<!ENTITY % base “(deepDish \| thinCrust \| stuffed)” >
	<!ENTITY % topping “( pepperoni \| mushrooms \| sausage \| pepper \| anchovies \| ...)” >
	<!ELEMENT pizza - -
	( %base;, tomatoSauce & cheese, %(topping)) >


	a (very) large number of element and attribute definitions
	organized as tagsets (core, base, additional, or auxiliary)
	grouped into classes


	all tag sets, all the time (the table d'hôte model)
	a few pre-selected combinations (the combination plate model)
	in completely unconstrained abandon (the smørgasbord model)
	one from column A, two from column B (the Chinese menu model)


	the core tagsets
	the base of your choice
	the toppings of your choice


	one only must be selected
	defines basic structural components
	currently defined:
		prose, verse, drama
		transcribed speech
		dictionaries
		terminological databases
	mixtures of bases require special treatment


	sets of elements for specialized application areas
	can be mixed and matched ad lib
	currently provided:
		linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....


	Just add another declaration to the subset


	(This is new in TEI P4)


	enables all declarations within the tagset marked section defined in the main TEI dtd
	these may include element, attribute, and class definitions


	Within the main DTD:
		the declarations making up each tagset are enclosed by an IGNORE marked section
		the declarations for each element are enclosed by an INCLUDE marked section
	this can be over-ridden by your declaration within the DTD subset


	In DTD subset
		selection of tag sets
		specification of document entities
	in TEI.extensions.ent
		renaming of elements
		suppression of elements
		modification of TEI classes
	in TEI.extensions.dtd
		definition of new elements


	typically will include entity declarations for embedded graphics etc.
	may also invoke special characters etc.


	rename elements
		<!ENTITY % n.p “para” >
	undefine elements
	<!ENTITY % seg “IGNORE”>
	The pizzaChef gives you a list of all the elements available from your chosen tagsets, and generates extension files for you


	supply additional (or replacement) declarations


	supply entirely new elements and embed them in the architecture


	In the DTD subset we write:
	<!ENTITY % TEI.prose "INCLUDE">
	<!ENTITY % biblStruct "IGNORE">
	In the prose tagset it says:


	The carthage program removes
		parameterization in the DTD
		unreferenced or inaccessible elements
	The pizzachef website
		http://www.tei-c.org/pizza.html
	command line equivalent:
		http://www.tei-c.org/maketeidtd/


	Most TEI elements are assigned to one or more
		model classes, identifying their syntactic properties, or
		attribute classes, identifying their attributes
	This provides a (relatively) simple way of
		documenting and understanding the DTD
		parameterizing content models
		facilitating customization
	An alternative way of doing architectural forms


	divn: structural elements like divisions (<div>,<div1>, <div2>…)
	divtop: elements which can appear at the start of a divn element (<head>, <epigraph>, <byLine>…)
	chunk: paragraph-like elements (<sp>, <p>, <lg>…)
	phrase: elements which appear within chunks (<hi>, <foreign>, <date> …)


	Each model class is defined as a pair of parameter entities



	Reference to class members is always indirect


	Each model class is defined as a parameter entity, containing
		a reference to an initially null extension class
		a list of members
	To add a new member to a class, we redefine the extension class:


	global: attributes which are available to every element (n, lang, id, TEIform)
	linking: attributes for elements which have linking semantics (targType, targOrder, evaluate



	protects applications from the effect of element renaming
	<titre TEIform="title">...</titre>
	protects applications from the effect of syntactic sugar
	<abc type="xyz”> can be rewritten as
	<xyz TEIform="abc">