.sr docfile = &sysfnam. ;.sr docversion = 'Draft';.im teigmlp1 .* Document proper begins. Living with the Guidelines: <title>An Introduction to TEI Tagging <author>Lou Burnard <author>C. M. Sperberg-McQueen <docnum>TEI &docfile. <date>&docdate. </titlep> <!> <abstract> This document summarizes some basic issues and problems in the encoding of electronic texts for interchange. It provides a brief introduction to the recommendations of the Text Encoding Initiative (TEI) and shows how these Guidelines may be used in the encoding of a wide variety of commonly encountered textual features, in such a way as to maximize the usability of electronic texts and to facilitate their interchange among scholars using different hardware and software environments. </abstract> <toc> </frontm> <!> <body> <!> <h1>About the TEI and This Document <p> The Text Encoding Initiative is a cooperative effort in the textual research community to develop and disseminate guidelines for the encoding and interchange of machine-readable texts. It is sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. Funding comes in part from the U.S. National Endowment for the Humanities, Directorate XIII of the Commission of the European Communities, and the Andrew W. Mellon Foundation. Equally important has been the donation of time and expertise by the many members of the research community who have served on the TEI's Working Committees and Working Groups. <p> The TEI Guidelines are designed to be used in a broad range of applications; they are flexible, easy to use, and extensible. Because their first goal is to be useful to researchers, however, they necessarily consider some rather esoteric or specialized applications, which researchers will never need to consider. This document introduces the rather small set of textual features which most users of the scheme will probably use most of the time. We recognize however that we are addressing a very diverse community; if the tags described here are not those you need most often, bear with us. The final chapter mentions some aspects of the full TEI scheme not discussed here, and includes pointers to places in the Guidelines where these are discussed. For a more complete and formal treatment of the topics adumbrated in this document, please refer to the Guidelines themselves: TEI document TEI P1, <cit>Guidelines for the Encoding and Interchange of Machine-Readable Texts</cit> ed. C. M. Sperberg-McQueen and Lou Burnard, version 1.1 (Chicago, Oxford: Text Encoding Initiative, 1990), referred to here as <q>the Guidelines</q> or as <q>P1</q>. <!> <h1>What is Markup? <p> By <term>markup</term> we mean all the information contained in a computer file other than the text itself, by means of which computer programs are able to manipulate texts in useful ways; the term is borrowed from the history of printing, where <term>markup</term> referred to the notations made in the margins of a text to guide the compositor in the layout of the text. Markup intended to specify the proper layout or presentation of a text is still the most common type of markup in computer files. <p> Because computers can be used for far more than printing the text out on paper, however, markup can be used to guide processing of any type, not just printing. Markup of texts for research purposes may frequently specify not the proper font and leading for a text but (for example) its rhetorical or syntactic structure. Indeed, <emph>any</emph> aspect of a text of importance to a researcher can be signalled by markup so that software can treat it in an appropriate way. <p> In general, markup in the TEI scheme is not intended as a way of controlling any one piece of software. Although convenient, such markup gets in the way as soon as one wishes to use some other program to work on the text. It also makes it difficult to change the way one treats all the pieces of text of a certain type. It is easier to work flexibly with text, and easier to use many different kinds of software with the same machine-readable text, if (a) the markup in a text is clearly distinguishable from the text itself, and (b) the markup specifies not <emph>how to process</emph> the bit of text being marked (<term>procedural markup</term>) but <emph>what it is</emph> (<term>descriptive markup</term>). Given markup which describes the text itself, rather than what a particular program is to do with it, any piece of software can decide for itself how to process the text. A common method is to use a lookup table which associates the generic markup tags of the text with specific processing instructions; by analogy with similar shorthands used in publishing, such tables are often called <term>style sheets</term>. <!> <h1>What is SGML? <p> The Standard Generalized Markup Language (SGML) is a language for defining <term>markup languages</term>, i.e. sets of markup tags with rules defining when they are applicable and how they can interrelate. SGML does not itself define a markup language. It merely allows its users to define one. Using SGML, for example, one may specify that a <tag>novel</tag> must begin with <tag>front</tag> matter, followed by a <tag>body</tag> which consists of a series of <tag>chapter</tag>s. And so on. <p> The TEI encoding scheme uses SGML to define a set of markup tags, and to define how they can be used. It uses English-language documentation like this document and like document TEI P1 to define what the markup means. <p> There are three characteristics of SGML which distinguish it from other markup languages: it is designed for <term>descriptive</term> rather than <term>procedural</term> markup; it allows one to define distinct <term>document types</term> with distinct rules for their structures and the markup they can contain; and it is independent of any one system for representing characters. Procedural and descriptive markup have already been encountered. The notion of document types allows SGML to verify that the markup in a text actually follows the rules laid down (by the user) for that type; equally important, it allows software developers to exploit the knowledge about text structures which is embodied in the rules for different document types, and to create more intelligent software as a result. SGML's independence of specific character sets is important for its role in the interchange of documents among scholars using different types of machines. <p> SGML-based markup languages, including that of the TEI, regard text not as an undifferentiated sequence of words, much less of bytes, but as a consistently arranged hierarchy of many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages. Unlike other markup languages which share this view of text as a complex hierarchical structure, SGML and the TEI allow more than one single hierarchical structure to be discerned and marked up in a single text. <p> The technical term used in the SGML standard for a textual unit, viewed as a structural component, is <term>element</term>. Different types of elements are given different names, but SGML provides no formal way of expressing the meaning of a particular type of element, other than its relationship to other element types. That is, all one can say about an element called (for instance) <q>blort</q> is that instances of it may (or may not) occur within elements of type <q>farble</q>, and that it may (or may not) be decomposed into elements of type <q>blortette.</q> It should be stressed that the SGML standard is entirely unconcerned with the semantics of textual elements: these are application dependent. It is up to the creators of SGML conformant tag sets (such as the TEI Guidelines) to choose intelligible names for the elements they identify and to document their proper use in text markup. <p> From the need to choose element names indicative of function comes the technical term for the name of an element type, which is <term>generic identifier</term>, or GI. <p> Within a marked up text (a <term>document instance</term>), each element must be explicitly marked or tagged in some way. The standard provides for a variety of different ways of doing this, the most commonly used being to insert a tag at the beginning of the element (a <term>start-tag</term>) and another at its end (an <term>end-tag</term>). The start- and end-tag pair are used to bracket off the element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, an embedded speech element in a text might be tagged as follows: <xmp> <![ CDATA [ ... Rosalind's remarks <speech>This is the silliest stuff that ere I heard of!</speech> clearly indicate ... ]]> </xmp> <!> As this example shows, a start-tag takes the form <tag tei=no>name</tag>, where <q><</q> is a string indicating the start of the start-tag, <q>name</q> is the generic identifier of the element which is being delimited, and <q>></q> is the string indicating the end of a tag. An end-tag takes the form <tag tei=no>/name</tag>, where <q></</q> is a string marking the start of an end-tag, <q>name</q> is the generic identifier of the element being closed and, as before, <q>></q> is the string indicating the end of a tag. <p> Other than start-tags and end-tags, only one type of SGML markup need concern us here: SGML <term>entity references</term>. SGML entities are a simple and flexible method of encoding and naming arbitrary strings of characters. An SGML entity has a name and a definition. When an entity is referred to in an SGML document, its name appears in the document; in the output, the SGML processor replaces the name of the entity with its definition. Entity references are thus a convenient way both of including large quantities of text in a document (for example <q>boilerplate text</q> used in several places in one or several documents) and of handling characters needed in a document but not present on the keyboard. For more information on this latter use, see the section on character sets below. <!> <h1>A Short Example <p> In this example, we first demonstrate how a passage of prose might be entered by someone aware of the need to be faithful to typographic appearances, but with little sense of the purpose of mark-up. In an ideal world, such output might be generated by a very accurate optical scanner. It attempts to be faithful to the appearance of the printed text, by retaining the original line breaks, by introducing blanks to represent the layout of the original headings and page breaks, and so forth. Where characters not available on the keyboard are needed (such as the accented letter `a' in `faa``l' or the long dash), it attempts to mimic their appearance. Such tricks are rarely portable and for analytic purposes, their use introduces needless complications. <xmp> CHAPTER 38 READER, I married him. A quiet wedding we had: he and I, the par- son and clerk, were alone present. When we got back from church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and John cleaning the knives, and I said -- `Mary, I have been married to Mr Rochester this morning.' The housekeeper and her husband were of that decent, phlegmatic order of people, to whom one may at any time safely communicate a remarkable piece of news without incurring the danger of having one's ears pierced by some shrill ejaculation and subsequently stunned by a torrent of wordy wonderment. Mary did look up, and she did stare at me; the ladle with which she was basting a pair of chickens roasting at the fire, did for some three minutes hang suspended in air, and for the same space of time John's knives also had rest from the polishing process; but Mary, bending again over the roast, said only -- `Have you, miss? Well, for sure!' A short time after she pursued, `I seed you go out with the master, but I didn't know you were gone to church to be wed'; and she basted away. John, when I turned to him, was grinning from ear to ear. `I telled Mary how it would be,' he said: `I knew what Mr Ed- ward' (John was an old servant, and had known his master when he was the cadet of the house, therefore he often gave him his Christian name) -- `I knew what Mr Edward would do; and I was certain he would not wait long either: and he's done right, for aught I know. I wish you joy, miss!' and he politely pulled his forelock. `Thank you, John. Mr Rochester told me to give you and Mary this.' I put into his hand a five-pound note. Without waiting to hear more, I left the kitchen. In passing the door of that sanctum some time after, I caught the words -- `She'll happen do better for him nor ony o' t' grand ladies.' And again, `If she ben't one o' th' handsomest, she's noan faa\l, and varry good-natured; and i' his een she's fair beautiful, onybody may see that.' I wrote to Moor House and to Cambridge immediately, to say what I had done: fully explaining also why I had thus acted. Diana and 474 JANE EYRE 475 Mary approved the step unreservedly. Diana announced that she would just give me time to get over the honeymoon, and then she would come and see me. `She had better not wait till then, Jane,' said Mr Rochester, when I read her letter to him; `if she does, she will be too late, for our honey- moon will shine our life long: its beams will only fade over your grave or mine.' How St John received the news I don't know: he never answered the letter in which I communicated it: yet six months after he wrote to me, without, however, mentioning Mr Rochester's name or allud- ing to my marriage. His letter was then calm, and though very serious, kind. He has maintained a regular, though not very frequent correspond- ence ever since: he hopes I am happy, and trusts I am not of those who live without God in the world, and only mind earthly things. </xmp> <!> <p> This transcription suffers from a number of shortcomings: <ul> <li>it was taken from an inexpensive readily available paperback edition, which means its text is reasonable but not authoritative; for the same amount of effort in transcription, a critical text could have been used which would be more useful to others <li>the page numbers and running titles are intermingled with the text in a way which makes it difficult for software to disentangle them <li>no distinction is made between single quotation marks and apostrophe, so it is difficult to know exactly what passages are in direct speech <li>the preservation of the copy text's hyphenation means that simple-minded search programs will not find the broken words <li>the accented letter in <q>faa``l</q> has been rendered by an improvised key sequence which follows no standard pattern and will be processed correctly only if the transcriber remembers to mention it in the documentation (sad experience suggests that he or she will quite likely forget) </ul> <p> We now present the same passage, tagged at a minimal level of detail using the tag set recommended by the Guidelines. Paragraph divisions (implied by indented lines in the first example) have been marked explicitly; apostrophes are distinguished from closing quotation marks; and the accented letter has been represented by an entity reference. The long dash, represented above by two consecutive hyphens, has also been rendered by an entity reference. Because we are interested in BronteE's text, not in the printing of one particular edition, the appearance and form of the chapter heading, running titles, etc. have not been transcribed. To make it easier to proofread and to refer to the copy text, its page divisions have been marked with an empty <tag>page.break</tag> tag. To simplify searching and processing, the lineation of original has not been retained and words broken by typographic accident at the end of a line have been re-assembled without comment. For convenience of proof reading, a new line has been introduced at the start of each paragraph, but the indentation is removed. <xmp> <![ CDATA [ <page.break n='474'> <div1 name=chapter n='38'> <p>Reader, I married him. A quiet wedding we had: he and I, the parson and clerk, were alone present. When we got back from church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and John cleaning the knives, and I said &.dash; <p><q>Mary, I have been married to Mr Rochester this morning.</q> The housekeeper and her husband were of that decent, phlegmatic order of people, to whom one may at any time safely communicate a remarkable piece of news without incurring the danger of having one's ears pierced by some shrill ejaculation and subsequently stunned by a torrent of wordy wonderment. Mary did look up, and she did stare at me; the ladle with which she was basting a pair of chickens roasting at the fire, did for some three minutes hang suspended in air, and for the same space of time John's knives also had rest from the polishing process; but Mary, bending again over the roast, said only &.dash; <p><q>Have you, miss? Well, for sure!</q> <p>A short time after she pursued, <q>I seed you go out with the master, but I didn't know you were gone to church to be wed</q>; and she basted away. John, when I turned to him, was grinning from ear to ear. <q>I telled Mary how it would be,</q> he said: <q>I knew what Mr Edward</q> (John was an old servant, and had known his master when he was the cadet of the house, therefore he often gave him his Christian name) &.dash; <q>I knew what Mr Edward would do; and I was certain he would not wait long either: and he's done right, for aught I know. I wish you joy, miss!</q> and he politely pulled his forelock. <p><q>Thank you, John. Mr Rochester told me to give you and Mary this.</q> <p>I put into his hand a five-pound note. Without waiting to hear more, I left the kitchen. In passing the door of that sanctum some time after, I caught the words &.dash; <p><q>She'll happen do better for him nor ony o' t' grand ladies.</q> And again, <q>If she ben't one o' th' handsomest, she's noan fa&.agrave;l, and varry good-natured; and i' his een she's fair beautiful, onybody may see that.</q> <p>I wrote to Moor House and to Cambridge immediately, to say what I had done: fully explaining also why I had thus acted. Diana and <pagebreak n='475'> Mary approved the step unreservedly. Diana announced that she would just give me time to get over the honeymoon, and then she would come and see me. <p><q>She had better not wait till then, Jane,</q> said Mr Rochester, when I read her letter to him; <q>if she does, she will be too late, for our honeymoon will shine our life long: its beams will only fade over your grave or mine.</q> <p>How St John received the news I don't know: he never answered the letter in which I communicated it: yet six months after he wrote to me, without, however, mentioning Mr Rochester's name or alluding to my marriage. His letter was then calm, and though very serious, kind. He has maintained a regular, though not very frequent correspondence ever since: he hopes I am happy, and trusts I am not of those who live without God in the world, and only mind earthly things. ]]> </xmp> <!> <h1>What is a TEI Text? <p> What does it mean to say that a text is <q>TEI conformant</q>? A full answer to this question involves an understanding of the various contexts or environments in which electronic texts may be used. At one extreme, a text may be prepared using a particular version of a particular software package on a particular machine, for use with that software package only. Its users and preparers may never have any intention of sharing the text with others, nor of using any texts prepared elsewhere. At the other extreme, a text may be prepared on many different systems as part of a co-operative data capture exercise, for use by several different people, all with differing objectives and different software systems. Most projects fall between these two extremes, often with different priorities at different times. How does the TEI project help either of them? <p> As we suggested above, encoding a text is fundamentally a process of deciding which textual features should be distinguished by markup of some kind, and of deciding on a suitable markup for them. The TEI Guidelines may be thought of as a codification of the distinctions which have been found helpful by most people most of the time when faced with this task. For the most part, these are optional features of a text: clearly, no-one could be expected to make all the distinctions or to capture all the textual features listed in P1 in every text prepared, for no matter how simple a purpose. Equally clearly though, every distinction made by P1 is made because for someone that distinction is important. <!> <h2>The notion of conformance <p> Returning to the question of conformance: if the Guidelines do not require that every distinction they specify be made in encoding a text, what in fact do they require? They say, in effect, <emph>if</emph> you wish to distinguish this feature in your text, then <emph>this</emph> is the tag you should use to identify it, and (possibly) this is the way that this textual feature should be related to other textual features in the text. If for example, you wish to distinguish proper names that are embedded in your text, the Guidelines advise you to use the tag <tag>propname</tag> for the purpose: they do <emph>not</emph> propose that all proper names in a text should be marked however. <p> In the remainder of this tutorial we will be presenting the TEI <q>core set</q>---that is, the collection of tags corresponding with the textual features that we believe are of most use to most people most of the time, and which are therefore recommended by the TEI Guidelines. A TEI-conformant document will probably make all the distinctions we discuss here, but its conformance, or lack of it, has much more to do with the way in which those distinctions are made explicit in the document, than with what they are. <p> A TEI-conformant text must, as a minimum, be parseable by an SGML processor using one or other of the published TEI document type definitions (DTDs). Strict TEI-conformance additionally involves adherence to various formal rules about the way in which SGML is used in a text (see sections 1.1.2 and 2.2 of the Guidelines for a discussion). These rules specify, for example, that end-tags must be supplied for every element, that elements must be renamed only by means of the indirection method sketched in chapter 8 of the Guidelines, that only a specific subset of SGML features be used and so forth. For interchange purposes, TEI conformance at present implies the use of a very restricted character set. <p> It should be stressed that the purpose of restricting TEI conformance in this way is to ensure that texts can be interchanged between different machines and operating systems without loss of information. Such restrictions make no sense, and are therefore not required, when texts are to be exchanged between the same kind of machine, or when they are not exchanged at all. To underscore this, the examples in this document often illustrate simple variations on strictly TEI-conformant form; where this is so, brief notes indicate what must be done to make the data strictly TEI conformant. <!> <h2>Conformance in different environments <p> Strict conformance may be desired or required when you are sending files to someone about whose system you know no details, when you are depositing a text in a text archive, or when you are working with software which accepts only fully-conformant TEI texts. <p> In many cases, a less strict adherence to the rules of the TEI scheme may be appropriate. If you have SGML software, for example, then it is unnecessary to limit yourself, in the work you do on your own machine, to the subset of SGML features allowed in strictly TEI-conformant documents, since it is easy to use SGML software to produce a TEI-conformant version of any SGML document which uses the TEI document type declarations. You may use some other software, on the other hand, which accepts most TEI-conforming documents, but places some further restriction on the SGML features which can be accepted. In this case prudence will dictate that you restrict yourself to the SGML features your software can handle. <p> If you do not have SGML software, you may wish to use some markup scheme designed around the software you use most: Word Perfect or Nota Bene users might develop a set of Word Perfect styles or Nota Bene styles corresponding to the TEI tags they use most often. As long as the mark-up scheme you use makes at least the same set of distinctions as those recommended by the Guidelines, then it will be simple to translate from your local scheme to the TEI scheme, and back. <p> The construction of a sensible local scheme depends entirely on the hardware and software you are using. What makes sense for a Macintosh user who shuttles constantly between Word and Hypercard, will not necessarily be the best approach for a PC user who seldom leaves Nota Bene, and neither will necessarily be apt for someone using a VAX. We have tried to be non-partisan in our examples of the many possible shortcuts and keyboarding conventions available, but you should remember that these are only examples of techniques we have found useful---your environment will be different, and you will probably find better horses for your courses. <!> <h2>Character Sets and Conformance <p> Character set incompatibilities pose serious problems for the exchange of machine-readable texts among scholars; many common methods of exchanging texts fail for texts which contain characters other than the twenty-six basic letters of the Latin alphabet, the ten Arabic numerals, and some common punctuation marks. Accented characters, braces and brackets, and many other characters may not arrive at all, or may arrive as undecipherable nonsense. The TEI Guidelines define a <q>safe</q> set of characters for interchange using today's systems, and recommend the use of entity references for all other characters. Because the shortcomings of current systems will not (we hope!) be with us forever, however, adherence to these restrictions is not a necessary part of TEI-conformance, though it may be highly desirable in certain situations. <p> In your own work on your own machine, however, there is <emph>no reason</emph> not to use all the characters available in your machine's character set. When you wish to exchange texts with users of other systems, you can transform any such characters into SGML entity references, by using a simple global search and replace function for example. Section <hdref refid=chars> discusses the use of SGML entity references in more detail. <p> Just as special purpose programs may be needed to convert from the form in which it is convenient to enter text into a TEI-conformant one, so it is likely that special-purpose programs will be developed to convert a TEI-conformant text into one that can be reliably transported across networks, possibly involving some data compression as well as translation of <q>awkward</q> characters, together with similar programs to do the opposite. Such programs have yet to be written however. <!> <h2>The structure of a TEI text <p> All TEI-conformant texts contain (a) a <term>TEI header</term> and (b) the transcription of the text proper. The TEI header provides information analogous to that provided by the title page of a printed text. It contains a description of the machine-readable text, a description of the way it has been encoded, and a revision history; these are delimited by the <tag>file.description</tag>, the <tag>encoding.declarations</tag>, and the <tag>revision.history</tag> tags, respectively. In the TEI document type declarations, the text transcription is always divided into <tag>front</tag>, <tag>body</tag>, and <tag>back</tag> sections, of which the first and the last are optional. The overall structure of a typical TEI text, therefore, is this: <xmp> <![ CDATA [ <TEI.1> <TEI.Header> <file.description> <!-- ... Description of this machine-readable text --> </file.description> <encoding.declarations> <!-- ... Description of the encoding conventions --> </encoding.declarations> <revision.history> <!-- notes on changes to the electronic text --> </revision.history> </TEI.Header> <text> <front> <!-- front matter ... --> </front> <body> <!-- body of text ... --> </body> <back> <!-- back matter ... --> </back> </text> </TEI.1> ]]> </xmp> <p> The remainder of this Guide describes the textual features most commonly distinguished in the transcription of prose texts, and the tags used to define them. Chapter <hdref refid=teihead> describes the basic structure and the tags most commonly used in the TEI header. <!> <h1>Marking Divisions of the Text <p> The overall structure of TEI documents was described above: the <tag>text</tag> portion of the document is divided into front matter, body, and back matter. Front matter and back matter are described below (chapter <hdref refid=fronbac>). <p> The body of a text may be a series of paragraphs (marked with <tag>p</tag> ... <tag>/p</tag>), or it may be divided into chapters, sections, subsections, etc. In the latter case, the <tag>body</tag> is divided into a series of <tag>div1</tag>s, which may be further subdivided, as discussed below. <!> <h2 id=h22>The DIV1 element <p> Major structural divisions within the body of each work are indicated using the tag <tag>div1</tag> (TEI P1 5.2.4). This takes three attributes, <att>name</att>, <att>ID</att> and <att>n</att>. <!> <h3 id=h221>The NAME attribute <p> This indicates the conventional name for this category of text division. Its value will typically be `Book', 'Chapter', 'Poem', etc. Other possible values include `Group' for groups of poems etc. treated as a single unit, `Sonnet', `Speech', and `Song'. Note that the name attribute need be supplied only for the first <tag>div1</tag> in a <tag>body</tag> as its value is assumed to apply for all subsequent <tag>div1</tag>s within the same <tag>body</tag>. <!> <h3 id=h222>The ID attribute <p> This specifies a unique identifier for the division, which may be used for cross references and for commentary. It is probably a good idea to provide an <att>ID</att> attribute for every structural unit in a text, and to derive the ID values in some systematic way, for example by appending a section number to a short code for the title of the work in question. Thus <cit>Paradise Lost</cit> Book 10 might have the identifier PL10; `On Time', which is the sixth item in Milton's 1645 <cit>Poems</cit>, could have the identifier PO4506, and so on. <!> <h3 id=h223>The N attribute <p> This specifies a mnemonic short name for the division, which can be used to identify it in preference to the ID. If a conventional form of reference or abbreviation for the parts of a work already exists (such as the book/chapter/verse pattern of Biblical citations), the <att>N</att> attribute is the place to record it. <!> <h2 id=h23>Structural subdivisions (div2, div3 ...) <p> Where structural subdivisions smaller than a <tag>div1</tag> are necessary, the <tag>div1</tag> may be divided, after an introductory series of paragraphs, into a series of <tag>div2</tag> elements. Structural subdivisions of <tag>div2</tag> may be tagged as <tag>div3</tag>, and yet further subdivisions as <tag>div4</tag>. At this point the pre-defined hierarchy of elements ends; if more than four levels are present in a text, one will need to extend the TEI tag set (on which see chapter 8 of the Guidelines). <p> The <tag>div2</tag> and smaller elements have the same attributes as the <tag>div1</tag> element taking similar values. It is often good practice for the value of the ID attribute of a <tag>div2</tag> always to be the same as that of the enclosing <tag>div1</tag>, extended by a consistent number of digits. For example, the overall structure of Milton's <q>Arcades</q> might be tagged as follows: <xmp> <![ CDATA [ <div1 id=PO4516 n='Arcades' name='Group'> .... <div2 id=PO45161 n='Look Nymphs' name='Song'> ... <div2 id=PO45162 n='Stay gentle Swains' name='Speech'> ... <div2 id=PO45163 n='Ore the smooth enameld green' name='Song'> ... <div2 id=PO45164 n='Nymphs and Shepherds dance no more'> ... </div1> ]]> </xmp> A different numbering scheme may be used for <att>ID</att> and <att>N</att> attributes: this is often useful where a canonical reference scheme is used which does not tally with the structure of the work. For example, in a novel divided into books each containing chapters, where the chapters are numbered sequentially through the whole work, rather than within each book, one might use a scheme such as the following: <xmp> <![ CDATA [ <div1 id=TS01 n='1' name='Volume'> <div2 id=TS011 n='1' name='Chapter'> ... <div2 id=TS012 n='2'> ... </div1> <div1 id=TS02 n='2' name='Volume'> <div2 id=TS021 n='3'name='Chapter'> ... <div2 id=TS022 n='4'> ... </div1> ]]> </xmp> Here the work has two volumes, each containing two chapters. The chapters are numbered conventionally 1 to 4, but the id values specified allow them to be regarded additionally as if they were numbered 1.1, 1.2, 2.1, 2.2. <!> <h2 id=h24>Paragraphs <p> In prose, paragraphs should be tagged with the <tag>p</tag> tag. If you need to refer to other segments of text not defined as structural units, the Guidelines provide a general purpose segmentation tag <tag>s</tag> (P1 5.3.1). For verse, P1 recommends the tagging of individual verse lines and, where these are grouped into regular stanzaic patterns, suggests tags such as stanza or couplet (P1 7.3.1), which can be made to bear attributes for metrical or other associated information, though these are not implemented as yet in the published DTDs. <!> <h2 id=h25>Headings and Closings <p> Every <tag>div1</tag>, <tag>div2</tag>, etc. may have a title or heading at its start: this is marked with the <tag>head</tag> tag. Where this title is completely regular (for example 'Chapter 1') or has been used as the value of the <att>N</att> attribute, it may be omitted; where it contains text it should always be included. For example, the start of Hardy's <cit>Under the Greenwood Tree</cit> might be encoded as follows: <xmp> <![CDATA[ <div1 id=UGT1 n='Winter' name='Part'> <div2 id=UGT11 n='1' name='Chapter'> <head>Mellstock-Lane</head> <p>To dwellers in a wood almost every species of tree ... ]]> </xmp> <p> In relatively rare cases, the text of a section will be followed by a closing such as <q>End of Chapter 1</q>; where thought desirable, this may be transcribed as a <tag>trailer</tag>. <!> <h1>Page and Line Numbers <p> The Guidelines provide two methods of marking page and line numbers. The hierarchy of volume, page, (column,) and line can be neatly expressed with a concurrent markup stream separate from the main markup hierarchy (see P1 section 5.6); for data entry purposes, however, the simpler scheme we describe here may be more convenient. After data entry, this markup can be transformed mechanically into that required for a concurrent markup hierarchy, if that is supported by the software in use. <p> Page breaks, column breaks, and line breaks may be marked with the following tags. All are empty elements: they mark a single point in the text, not a span of text, and they take no end-tags. On all of them, use <att>n</att> to supply number of the page, column, or line beginning at the tag, or give the number of the first and omit further numbers, which can be calculated automatically. If pagination etc. are marked for more than one edition, specify the edition in question using the <att>ed</att> attribute. <gl> <gt><tag>page.break</tag> <gd>marks the beginning of a new page; for early printed books, the <att>sig</att> and <att>catchword</att> attributes can be used to give the signature number and catchword printed on this page, if of interest <gt><tag>line.break</tag> <gd>marks the beginning of a new line <gt><tag>col.break</tag> <gd>marks the beginning of a new column on multi-column pages; omit for single-column pages <gt><tag>milestone</tag> <gd>marks any break-point, using the <att>unit</att> attribute to specify what type of break is tagged (page, column, line, book, poem, canto, etc.) </gl> <p> It is very useful to record the pagination of the copy text, since it simplifies later verification of the text. Retention of line breaks is useful for the same reason, though it is sometimes felt to be less crucial. The milestone tag may be used to replace all the others, or the others may be used as a set; they should not be mixed arbitrarily. <!> <h1>Marking Highlighted Phrases <h2>Changes of Typeface etc. <p> Highlighted words or phrases are those made visibly different from the rest of the text, typically by a change of type font, handwriting style, or ink color. They are marked in some way in order to draw the reader's attention to them. <p> It is recommended that highlighted text be tagged with the underlying feature signaled by the highlighting. The following tags may be used to mark features often realized with highlighting: <gl> <gt><tag>emph</tag> <gd>marks emphatic (stressed) words <gt><tag>foreign</tag> <gd>marks words in a foreign language <gt><tag>cited.word</tag> <gd>marks words mentioned, not used <gt><tag>term</tag> <gd>marks words highlighted as technical terms <gt><tag>title</tag> <gd>marks titles of books and journals </gl> <p> It is not always possible and may not be considered desirable to interpret the changes of rendering of a text in this way. In such cases, the tag <tag>highlighted</tag> may be used to mark highlighted text without making any claim as to its status. <p> The global <att>rendition</att> attribute should be used wherever necessary to specify details of how the highlighting is realized. For example, an emphasized phrase rendered in bold might be tagged <tag>emph rendition='Bold'</tag>, and one in italic <tag>emph rendition='Italic'</tag>. <p> Some features (notably quotations and glosses) may be found in a text either marked by highlighting, or with quotation marks. In either case, the tags <tag>q</tag> and <tag>gloss</tag> (as discussed in the following section) should be used. If the rendition is to be recorded, use the <att>rendition</att> attribute. <p> As an example of the tags defined here, consider the following sentence: <!-- Theodore M. Andersson, Preface to the Nibelungenlied --> <!-- Theodore M. Andersson, Preface to the Nibelungenlied, p. 3 --> <lq> On the one hand the <cit>Nibelungenlied</cit> is associated with the new rise of romance of twelfth-century France, the <ital>romans d'antiquite''</ital>, the romances of Chre''tien de Troyes, and the German adaptations of these works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram von Eschenbach. </lq> Interpreting the role of the highlighting, the sentence might look like this: <!> <xmp> <![CDATA[ On the one hand the <title>Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquité, the romances of Chrétien de Troyes, ... ]]> Describing only the appearance of the original, it might look like this: <![CDATA[ On the one hand the <highlighted rendition=italic>Nibelungenlied</highlighted> is associated with the new rise of romance of twelfth-century France, the <highlighted rendition=italic>romans d'antiquit&eacute;</highlighted>, the romances of Chr&eacute;tien de Troyes, ... ]]>

Quotations and Related Features

Like changes of typeface, quotation marks are conventionally used to denote several different features within a text. When possible, we recommend that the underlying feature be tagged, rather than the simple fact that quotation marks appear in the text.

The most common and important use of quotation marks is to mark quotations. A quotation is a piece of text attributed by the author or narrator to another. The tag q should be used for a quotation, no matter how it appears in the text. If it is desired to record whether the quotation was printed in-line or set off as a display or block quotation, the rendition attribute should be used.

Quotations embedded within quotations are treated in the same way as ordinary quotations. Interruptions of the quotation by a narrator may be tagged with the tag in.quot.

Quotations may be accompanied by a reference to the source or speaker, using the sp attribute, whether or not the source is given in the text.

Examples:

<![CDATA[ Few dictionary makers are likely to forget Dr. Johnson's description of the lexicographer as <q>a harmless drudge.</q> <p> <q>Who-e debel you?<in.quot>--he at last said--</in.quot>you no speak-e, damme, I kill-e.</q> And so saying, the lighted tomahawk began flourishing about me in the dark. <q sp=Wilson>Spaulding, he came down into the office just this day eight weeks with this very paper in his hand, and he says:---<q sp=Spaulding>I wish to the Lord, Mr. Wilson, that I was a red-headed man.</q></q> ]]> In the second example, the phrase he at last said interrupts the direct quotation; in the third, the speaker (Wilson) quotes another speaker (Spaulding), and so we have one q embedded within another.

The creator of the electronic text must decide whether the quotation marks are replaced by the tags or whether the tags are added and the quotation marks kept. If the quotation marks are removed from the text, the rendition attribute may be used to record the way in which they were rendered in the copy text. This attribute is optional; its use or non-use, and the retention or elimination of explicit quotation marks in the content of the file, should be described in the encoding.declarations of the TEI header.

The rendition attribute may be used to supply further details of the way the quotation is rendered, for example the style of quotation marks employed, whether the quoted material is in-line or displayed etc. Specific recommendations are given for these in P1.

Among other features often indicated by quotation marks are the following: cited.word marks words mentioned, not used (e.g. `blort' is a useful neologism) so.called marks words used in a special or ironic sense (e.g. a `just' war) title.piece use for analytic titles, i.e. titles for items such as poems, articles or chapters which are published as part of a larger titled whole. (e.g. Milton's sonnet `On his blindness') Like q, these can all carry the rendition attribute.

As with highlighting, it is not always possible and may not be considered desirable to interpret the function of quotation marks in a text in this way. In such cases, the tag q.mark may be used to mark quoted text without making any claim as to its status. Like the other tags just discussed, it may have a rendition attribute.

Foreign words or expressions

Words or phrases which are not in the main language of the text may be tagged as such, at least where the fact is signaled in the text. Where possible, the language shift should be indicated by attaching the attribute lang to an existing element. Where there is no applicable element, the tag foreign may be inserted, again using the lang attribute to indicate the language of the foreign words.

Example:

<![CDATA[ John has real <foreign lang=FR>savoir-faire</foreign>. Have you read <title lang=GE>Die Dreigroschenoper</title>? <cited.word lang=FR>Savoir-faire</cited.word> is French for knowhow. ]]>

As the last example shows, the foreign tag should not be used to tag foreign words which are mentioned, rather than used in the text.

Notes

All notes, footnotes, endnotes, marginalia, etc. should be marked using the same tag, note. Where possible, the body of a note should be inserted in the text, at the point at which its identifier or mark first appears. This may not be possible for example with marginalia, which may not be anchored to an exact location. For simplicity, it may be adequate to position marginal notes before the relevant paragraph or other element.

The n attribute may be used to supply the number or identifier of a note if this is required.

The following attributes are available to supply further information about notes: type describes the type of note. Possible values include `annotation', `gloss', `explanation', `preliminary', `temporary' etc. source identifies the author of the note, if different from the author of the text. Suggested values include `ed[itor]', `comp[iler]', `transcriber'. Other values may be used as needed. place specifies where the note appears in the copy text. Suggested values include `foot', `end', `inline', `left' (for notes in left margin), `right' (for notes in right margin). Other values may be used as needed, e.g. `app1', `app2' to distinguish between annotations in separate apparatus. anchored indicates whether the copy text shows the exact place of reference for the note (anchored=yes) or not (anchored=no)

Examples:

<![CDATA[ Collections are ensembles of distinct entities or objects of any sort. <note place=foot n=1> We explain below why we use the uncommon term <cited.word rendition='6u 9u'>collection</cited.word> instead of the expected <cited.word rendition='6u 9u'>set</cited.word>. Our usage corresponds to the <cited.word>aggregate</cited.word> of many mathematical writings and to the sense of <cited.word>class</cited.word> found in older logical writings. </note> The elements ... <stanza id=RAM609> <note place=margin>The curse is finally expiated</note> <l>And now this spell was snapt: once more <l>I viewed the ocean green, <l>And looked far forth, yet little saw <l>Of what had else been seen &dash; ]]>

Text Features Not Signalled Typographically

P1 proposes tags for a number of features which have been found to be useful in a variety of analytic contexts, even though they are not generally signalled by any consistent or widely followed typographic convention. These include editorial interventions of a simple kind, dates, names and numbers.

Editorial Interventions

The following five tags are provided for simple editorial interventions. For full textual critical analysis, more detailed tags are described in the Guidelines (but not in this tutorial). sic encloses a passage believed to be erroneous corr encloses a passage corrected from the original norm encloses a passage normalized from the original add encloses a passage added to the original del encloses a passage deleted from the original Each of the above tags can take one or more of the following attributes: ed specifies who is responsible for the intervention sic gives the state of the text before the intervention corr gives the state of the text after the intervention For example, the reading

... for his nose was as sharp as a pen and a' table of green feelds might be rendered as <![CDATA[ ... for his nose was as sharp as a pen and <norm sic="a'">he</norm> <corr sic='table' ed=Gifford>babbl'd</corr> of green <norm sic='feelds'>fields</norm> ]]>

Names, dates, numbers, and abbreviations

Some textual features have special interest for particular kinds of research; we mention some important examples here, though the tags described in this section are strictly optional and thus not part of the basic core of TEI tags. The markup described here is of particular interest to computational linguists, because names, dates, numbers, and abbreviations often pose special problems for natural language processing; if these features are explicitly marked, they can be recognized and handled specially. Historians will also find the tagging of dates and proper nouns useful in many contexts, though the simple tags given here will doubtless suffice only for simple cases.

The tags in question, and their attributes (all optional, except as noted), are: propname proper nouns of any type, with attributes as follows type (recommended) suggested values are `person', `place', `institution', `product', `acronym'; other values may be used as needed referent supplies identification for the person or thing named, using some identifier scheme, e.g. a key to a database record normalized gives a normalized form of the name (e.g. for onomastic study) date single dates in any form; attributes (all optional) are type calendar in which the date is expressed; suggested values are `Gregorian', `Julian', `Roman', `Mosaic', `Revolutionary', `Islamic'; other values may be used as needed value the date expressed in the form `yyyy-mm-dd'; for partial dates value may be truncated certainty degree of certainty for the date; possible values are `approx', `circa', `before', `after', etc. date.range phrases or expressions denoting ranges of dates; attributes (all optional) are type same as for date tag from the beginning of the date range, expressed in the form `yyyy-mm-dd'; for partial dates truncate value to the end of the date range, expressed in the form `yyyy-mm-dd'; for partial dates truncate value from.cert degree of certainty for the start date; values same as for certainty of date tag to.cert degree of certainty for the end date; values same as for from.cert attribute num numbers in any form; attributes are type suggested values are `cardinal' (e.g. 21), `ordinal' (e.g. 21st), `fraction' (e.g. 1/2), `percentage' (e.g. 12.5%); other values may be used as needed value value of the number in an application-dependent standard form; the form used should be described in the TEI header's encoding declarations section abbrev abbreviations of any type; attributes are full expanded form of the abbreviation type possible values include `title', `initials', `degree', `acronym', etc. An example (with perhaps more tags than most scholars would find useful in this particular text):

<![ CDATA [ <propname type='place'>Montaillou</propname> is not a large parish. At the time of the events which led to <propname type='person' referent='Benedict XII, Pope of Avignon (Jacques Fournier)'>Fournier's</propname> investigations, the local population consisted of between <num type='cardinal'>200</num> and <num type='cardinal'>250</num> inhabitants. At <date.range type='Gregorian' from='1380' from.cert='approx' to='1400' to.cert='before'>the end of the fourteenth century</date.range>, after the Black Death and the first effects, direct or indirect, of the English wars, the hearth rolls and census books of the <propname type='place'>Comt&eacute; de Foix</propname> show the same community as consisting of no more than about <num type='cardinal' value='100'>one hundred</num> souls, divided up into <num type='cardinal' value='23'>twenty-three</num> hearths or households. This was the usual drop in population (over <num type=fraction value='0.5'>a half</num>) recorded almost everywhere in southern France after the catastrophes of <date.range from='1348' from.cert=approx to=1400 to.cert='?'>the second part of the fourteenth century. ]]>

It should be noted once more that these tags are optional, not required by the TEI.

Lists

The tag list is used to mark any kind of list. A list is a sequence of text items, which may be ordered or unordered; the attribute type is used to indicate which. Other values may be used if needed.

Individual list items are tagged with item. The first item may optionally be preceded by a head, which gives a heading for the list. The numbering of a list may be omitted (if reconstructible), indicated using the n attribute on each item, or (rarely) tagged as content using the enum tag. For example:

<![ CDATA [ <list type=ordered> <item>First item in list.</item> <item>Second item in list.</item> <item>Third item in list.</item> </list> <list type=ordered> <item n=1>First item in list.</item> <item n=2>Second item in list.</item> <item n=3>Third item in list.</item> </list> ]]> The two styles should not be mixed in the same list.

Some lists have internal structure. In complex cases, where list items contain many components, the list is better treated as a table, on which see the Guidelines. Alternatively, a simple two-column table may be treated as a glossary list, marked by the tag list.gl. Here, each item comprises a term and a gloss, marked with gl.term and gl.gloss respectively. These correspond to the tags term and gloss, which can occur anywhere in prose text. The two columns may be headed with head.term and head.gloss elements, as below:

<![ CDATA [ <list.gl> <head>Vocabulary</head> <head.term>Middle English</head.term> <head.gloss>New English</head.gloss> <gl.term>nu <gl.gloss>now <gl.term>lhude <gl.gloss>loudly <gl.term>bloweth <gl.gloss>blooms <gl.term>med <gl.gloss>meadow <gl.term>wude <gl.gloss>wood <gl.term>awe <gl.gloss>ewe <gl.term>lhouth <gl.gloss>lows <gl.term>sterteth <gl.gloss>bounds, frisks <gl.term>verteth <gl.gloss><foreign lang=Latin>pedit</foreign> <gl.term>murie <gl.gloss>merrily <gl.term>swik <gl.gloss>cease <gl.term>naver <gl.gloss>never </list.gl> ]]> To convert this glossary to TEI-conformant interchange format, explicit end-tags would be provided for all elements.

Bibliographic Citations

Bibliographic citations may be wholly absent from many works transcribed for research; where they occur, however, it is very useful to mark them for special handling. No full treatment of bibliographic references can or need be given here; the need is to mark the reference provided by the text, not to correct it.

The following tags are provided for use in transcribing bibliographic references; in most cases, only the titles of articles, books, and journals need be marked (since they will require special handling---font shifts---to print correctly). The other tags are provided for cases where particular interest attaches to such details, and for use in structured citations (on which see below). Their use is strictly optional. author the author or responsible institution; may contain one or several names of persons or institutions; tag is optional editor the editor, compiler, or other person secondarily responsible for the work; use role attribute to distinguish editors, compilers, translators, etc.; tag is optional title title of the book or journal title.piece title of an article, chapter, poem, or other unit appearing within a book or journal (in library terms, the analytic title) series series name and volume number; tag is optional imprint imprint information (optional), including (all optional): publ.city city or town of publication publisher name of publisher or other agency publ.date date of publication citn.detail page numbers, section, etc.; optional comment any additional comment, e.g. an abstract; or annotation; optional

For example, the following editorial note might be transcribed as shown He was a member of Parliament for Warwickshire in 1445, and died March 14, 1470 (according to Kittredge, Harvard Studies 5. 88ff).

<![ CDATA [ He was a member of Parliament for Warwickshire in 1445, and died March 14, 1470 (according to <citn><author>Kittredge</author>, <title>Harvard Studies</title> 5. 88ff</citn>). ]]>

For the systematic encoding of bibliographies, P1 section 5.5 provides another tag, citn.struct, which requires its contents to be a bibliographic citation well formed according to the rules of the International Standard for Bibliographic Description (ISBD): the author, title, editor, series, imprint, and other information must all be tagged and provided in the order prescribed by ISBD.

Front and Back Matter

Front Matter

For many purposes, particularly in older texts, the preliminary material such as title pages, prefatory epistles etc. may provide very useful additional linguistic or social information. P1 provides a set of recommendations for distinguishing the textual elements most commonly encountered in front matter, which are summarized here.

Title page

The start of a title page should be marked with the tag title.page. All text contained on the page should be transcribed and tagged with the appropriate tag from the following list: doc.title Title of the work as given on the title page. doc.author Author's name as given on the title page ( note that this may be embedded within the doc.title) doc.imprint Imprint as given on the title page doc.date Date as given on title page epigraph Quotation etc. from other work on title page. If in verse, this should use the l tag to mark metrical lines. If a citation is given, identifying the source of the epigraph, this should be tagged using the citn tag. title.part Any other distinct part of the title page Typeface distinctions should be marked with the rendition attribute when necessary, as described above. Very detailed description of the letter spacing and sizing used in ornamental titles is not as yet provided for by the Guidelines. Changes of language should be marked by appropriate use of the lang attribute or the foreign tag, as necessary. Names, wherever they appear, should be tagged using the propname, as elsewhere.

Two example title pages follow:

<![CDATA[ <title.page rendition=Roman><doc.title> PARADISE REGAIN'D. A POEM In IV <highlighted>BOOKS</highlighted>. To which is added <title>SAMSON AGONISTES</title>. </doc.title><doc.author>The Author <propname>JOHN MILTON</propname></doc.author> <doc.imprint><propname>LONDON</propname>, Printed by <propname>J.M.</propname> for <propname>John Starkey</propname> at the <propname>Mitre</propname> in <propname>Fleetstreet</propname>, near <propname>Temple-Bar.</propname></doc.imprint> <doc.date>MDCLXXI</doc.date> </title.page> <title.page> <doc.title>Lives of the Queens of England, from the Norman Conquest; with anecdotes of their courts.</doc.title> <title.part>Now first published from Official Records and other authentic documents private as well as public.</title.part> <title.part>New edition, with corrections and additions</title.part> <doc.author>By Agnes Strickland</doc.author> <epigraph>The treasures of antiquity laid up in old historic rolls, I opened. <citn>BEAUMONT</citn></epigraph> <doc.imprint>Philadelphia: Blanchard and Lea</doc.imprint><doc.date>1860.</doc.date> ]]>

Prefatory matter

P1 proposes that at least the following varieties of prefatory matter should be distinguished within the front element: foreword a text addressed to the reader, by the author, editor or publisher, possibly in the form of a letter. dedication a text (often a letter) addressed to someone other than the reader in which the author typically commends the work in hand to the attention of the person concerned. abstract a prose argument summarizing the content of the work front.part any other distinct unit within the front matter list.cast a list of names, of persons represented in a drama, or of actors who played them, or of the two combined (P1 7.3.2.4). Each name should be tagged role or actor as appropriate.

Each of the above major structural units may contain low level structural or non-structural tags as described elsewhere. They will generally begin with a heading or title of some kind which should be tagged using the head tag. Epistles will contain the following additional elements: salute A formulaic salutation at the start of an epistle, for example My Lord, or Sir. This should be distinguished from a longer description of the addressee which occasionally prefixes it, and which is best regarded as part of the head signature A formulaic salutation at the end of an epistle, visually distinct and usually containing a name Epistles which appear elsewhere in a text will, of course, contain these same elements.

As an example, the dedication at the start of Milton's Comus should be marked up as follows:

<![CDATA[ <dedication> <head>To the Right Honourable <propname>JOHN Lord Viscount BRACLY</propname>, Son and Heir apparent to the Earl of {Bridgewater}, &amp;c.</head> <salute>MY LORD,</salute> <p>THis <highlighted>Poem</highlighted>, which receiv'd its first occasion of Birth from your Self, and others of your Noble Family .... and as in this representation your attendant <propname>Thyrsis</propname>, so now in all reall expression <signature>Your faithfull, and most humble servant <propname>H. LAWES.</propname> </signature></dedication> ]]>

Back Matter

Because of variations in publishing practice back matter can contain virtually any of the elements listed above for front matter, and the same tags should be used where this is so. Additionally, back matter may contain the following types of matter within the back element. Like the structural divisions of the body, all have an optional leading head and an optional closing trailer. Unlike the structural divisions of the body, these element have no internal structure: all are defined as sequences of paragraphs or paragraph-level elements like notes or lists. appendix an appendix, with optional leading head, a body formed of a series of paragraphs, and an optional closing tagged with trailer glossary a list of words and definitions, typically in the form of a glossary list, headed by a head and possibly preceded or followed by some paragraphs of text notes a series of notes, optionally preceded by a head bibliography a series of bibliographic references, typically in the form of a special bibliographic-list element list.citn, whose items are individual citn or citn.struct elements index a set of index entries, possibly represented as a structured list or glossary list, with optional leading head and perhaps some paragraphs of introductory or closing text colophon a description at the back of the book describing where, when, and by whom it was printed; in modern books it also often gives production details and identifies the type faces used; by definition the colophon occurs within the back matter---if the same information is given in the front matter, it must be tagged as front.part back.part any other distinct unit within the back matter

Character Sets, Diacritics, etc.

For those working with standard forms of the European languages, the TEI recommendations for character set use are simple. For local use, use whatever character set is supported by your machine and your software. If your software makes direct keyboard entry of special characters difficult, use whatever keyboarding conventions best suit your typing skill (e.g. a &sysbs.' &sysbs.' for a'', a &sysbs.&grave. &sysbs.&grave. for a``, a &sysbs.E for aE, a &sysbs.&circumflex. for a^, etc.) and use global search and replace functions to turn these keyboard shorthands into the proper characters. If you work with non-Latin scripts and there is a standard transliteration scheme in your field (e.g. for ancient Greek the beta code of the Thesaurus Linguae Graecae), use it. Any transliteration used should be reversible (this rules out a surprising number of schemes commonly used in normal writing) and will be most usable if it requires no special ligatures, ties, or diacritics (this rules out a surprising number of the remainder).

For interchange of files among systems, use SGML entity references to replace all characters not in the following list of characters which almost always survive electronic interchange intact:

a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 " % & ' ( ) * + , - . / : ; < = > ? _ (space) Perhaps surprisingly, this list excludes the following characters which often do not survive transfer across national boundaries or over standard wide-area networks (if you're just going from your Mac to your PC, these characters will probably be safe): The seventh letter in this set was typed as a circumflex, not as a logical-not symbol, before the file was moved to another system for printing. See what we mean? ! # $ [ \ ] ^ ` { } | ~ Less surprisingly, entity references must be used for all accented and extended-Latin characters, all non-Latin characters, and all symbols not on conventional computer keyboards.

You may use your own SGML entity names in TEI-conformant files, if you wish and if you provide standard SGML entity declarations for them, but the standard names (though long-winded) have the advantage of clarity; the characters intended are reasonably clear to any speaker of English who recognizes that a character is being named, often even without recourse to any list. This is not true of many other schemes for representing accented characters.

The entity names needed for the characters listed above as unsafe and for the accented characters of some major Western European languages are given below. (For a complete list of public entity sets and their contents, see the formal SGML standard, or the index by Joan Smith and Robert Stutely, or Charles Goldfarb's book on SGML.)

Unsafe characters: ! = &amp.excl; # = &amp.num; $ = &amp.dollar; [ = &amp.lsqb; \ = &amp.bsol; ] = &amp.rsqb; &circumflex. = &amp.circ; ` = &amp.lsquo; &grave. = &amp.grave; { = &amp.lcub; } = &amp.rcub; | = &amp.verbar; &tilde. = &amp.tilde; Digraphs: Uppercase A-E ligature = &amp.AElig; lowercase a-e ligature = &amp.aelig; Uppercase O-E ligature = &amp.OElig; lowercase o-e ligature = &amp.oelig; Umlaut and Trema: A** = &amp.Auml; E** = &amp.Euml; I** = &amp.Iuml; O** = &amp.Ouml; U** = &amp.Uuml; aE = &amp.auml; eE = &amp.euml; iE = &amp.iuml; oE = &amp.ouml; uE = &amp.uuml; Acute: A'' = &amp.Aacute; E'' = &amp.Eacute; I'' = &amp.Iacute; O'' = &amp.Oacute; U'' = &amp.Uacute; a'' = &amp.aacute; e'' = &amp.eacute; i'' = &amp.iacute; o'' = &amp.oacute; u'' = &amp.uacute; Grave: A`` = &amp.Agrave; E`` = &amp.Egrave; I`` = &amp.Igrave; O`` = &amp.Ograve; U`` = &amp.Ugrave; a`` = &amp.agrave; e`` = &amp.egrave; i`` = &amp.igrave; o`` = &amp.ograve; u`` = &amp.ugrave; Circumflex: A^ = &amp.Acirc; E^ = &amp.Ecirc; I^ = &amp.Icirc; O^ = &amp.Ocirc; U^ = &amp.Ucirc; a^ = &amp.acirc; e^ = &amp.ecirc; i^ = &amp.icirc; o^ = &amp.ocirc; u^ = &amp.ucirc; Tildes: A~ = &amp.Atilde; E~ = &amp.Etilde; O~ = &amp.Otilde; a~ = &amp.atilde; e~ = &amp.etilde; o~ = &amp.otilde; Consonants: C with cedille: &amp.Ccedille; and &amp.ccedille; N~ = &amp.Ntilde; n~ = &amp.ntilde; German <q>sharp s</q> or <q>ess-zett</q> = &amp.szlig;

The electronic title page

One of the few non-optional recommendations in the Guidelines is the provision of a tei.header as a means of documenting the electronic text. Every distinct work must be prefixed by a header, delimited by the tei.header tag (P1 section 4), and divided into the following three parts:

  • a file.description, identifying both the electronic text itself and its source text;
  • an encoding.declarations specifying the encoding scheme used;
  • a revision.history specifying the version of the file and its history.
The discussion below describes only the most generally applicable such tags; others may be added at a later stage, and the existing tags modified for consistency.

The file description

An example file description, using some local variants of the standard element names, follows:

<![CDATA[ <file.description> <title.statement> <title>John Milton: Paradise Lost (1667): a machine readable transcript. <responsible><role>funding<name>The Ohio State University <responsible><role>capture<name>Roy C Flannagan <responsible><role>encoding<name>Thomas N. Corns <responsible><role>validation<name>Antony Dogsbody <publication.statement> <p>Created 1990; this version not for public distribution. <source.description> <title>Paradise Lost (1667): A Scolar press facsimile <publisher>Scolar Press Ltd, Menston, England <publ.date>1968 </file.description> ]]> As shown above, the file description must contain a title.statement element, a publication.statement element, and a source.description element, in that order. The title.statement should contain a title and one or more responsibles, in that order. The latter correspond to the TEI standard statement.of.responsibility.

Title

This is equivalent to the title element, embedded in the title.statement of the standard tei.header. It contains the title of your encoding, not of the copy text. A useful convention is to construct such titles according to a standard pattern, such as (author):(title of text): a machine readable transcript, for example, Thomas Hardy: Tess of the D'Urbervilles: a Machine-readable transcript.

Statement.of.responsibility

Each statement.of.responsibility tag contains two tags role and name. The following roles are envisaged; there may be others:

  • funding (an agency which should be acknowledged)
  • capture (i.e. responsible for the initial capture/preparation of the text)
  • encoding (i.e. responsible for the addition of tags etc)
  • validation (i.e. proof reading, checking automatically etc.)
A single person may have more than one role, of course, in which case the roles should be combined in a single tag, thus: <![CDATA[ <role>encoding and validation<name>Lou Burnard ]]> The same name should always appear in the same form.

Source.description

The source description contains sufficient bibliographic information to identify the facsimile or other edition from which the machine readable text has been derived. The Guidelines allow for a very large number of bibliographic tags, but for most sources the components in the example (author, title, publisher, publ.date) should be adequate.

Title and publisher details should be given in the same form as they appear on the modern title page. Additional information about the source (for example, copy text shelfmark or call number) may be supplied if not included on the title page, tagged using the optional note tag (P1 4.3.6).

The encoding declarations

This section of the header contains a prose description of a number of detailed editorial options. These will be of particular use to anyone forming or maintaining a collection of machine readable texts, but may be unnecessary for texts used only internally.

Information such as any special codes used in the text, the precise significance of the values used as ID or N attribute values and so forth should be documented here.

<![CDATA[ <encoding.declarations> <p>Transcribed using all TEI recommendations as described in `Living with the Guidelines'. N references on page.break tags give page numbers in the LeFranc edition. End-of-line hyphenation and line breaks of original removed. </encoding.declarations> ]]>

The Revision History

This is an essential part of the header, in which entries should be made every time the text is altered in any way (P1 4.6). An example, using a local keyboarding convention, might look like the following:

<![CDATA[ <revision.history> <change>LB : 12 Oct 90 : Various typos in Book 2 removed <change>TC : 1 Oct 90 : Added structural tags throughout Book 1 and deleted duplicated material on pp 16-28 <change>RF : 10 Sep 90 : File made </revision.history> ]]> As this example shows, the revision.history tag contains a series of change elements (equivalent to the change.note element of P1 section 4.6), which should always be given with the most recent first. The change element contains three parts, divided in this format by colons only, so as to reduce typing effort. To make the text TEI-conformant, these colons would be expanded to the corresponding elements in P1 (who date and what) automatically. These three elements contain the initials of the person making the change, the date of the change, and a brief summary of the change. The change tag must also be expanded to read change.note and end-tags must be supplied by a parser of some kind to obtain TEI conformance.

When a file is first created, that is, when it gains a header, it should be given a change line specifying (as above) the text `file made'.

Other Problems: What is Not in This Document

While it covers most of the tags most users seem likely to need most of the time, this document should not be mistaken for a complete presentation of the TEI encoding scheme. A great many tags exist in the TEI scheme for specific applications, and some of the tags described here may in some contexts be used in ways not described here. This section describes some of the more obvious topics not covered here, on which further information may be found in P1.

Special characters other than those needed for the modern standard forms of Western European languages, and especially non-Latin scripts, are likely to require serious study of P1 chapter 3, on character sets, and of the various de facto, national, and international standards for character sets. Preparation of a formal Writing System Declaration for a new character set (e.g. one you have built yourself with a font program or a non-standard character set acquired from a vendor) will also require study of P1 chapter 3, and some patience. Most users will be able simply to refer to Writing System Declarations defined as part of the TEI system or, eventually, to declarations provided by creators of TEI-aware software.

Archives, libraries, and major projects interested in good bibliographic control will wish to use a fuller version of the TEI header than that described here. The TEI header (described in chapter 4 of P1) allows for full bibliographic descriptions of the machine-readable file and its copy text(s), and a wide range of formalized encoding declarations which can be used to inform potential users of the characteristics of particular data files. Corpus makers and researchers who encode analytic information in their texts should probably use the more formal encoding declarations rather than the informal prose described here.

The basic structural features (front matter, body, back matter) are all described completely or nearly completely in this document; some rare elements in front and back matter may be found in P1 sections 5.2.3 and 5.2.4. Basic structural tags for drama and verse are described in section 7.3, though no formal definition of the structural hierarchy of verse texts or collections of verse has yet been published. Structural tags for dictionaries and office documents are described (again, without formal definition of the document types) in sections 7.4 and 7.5 of P1.

Some less common non-structural tags have been passed over in silence in this document: tags for generating index entries, a structured variant of the bibliographic citation tag which allows better validation and might be used for large collections of bibliographic data, cross references within and among texts and hypertext links are all covered in chapter 5 of P1. The same chapter discusses methods of marking traditional reference schemes for texts (including but not limited to Biblical book/chapter/verse style references) and gives more detail than is found here on tags for describing the physical presentation (typography---or handwriting---and layout) of the copy text. (Really full physical description is a difficult task on which work is still in progress.)

A fuller description of lists, and tags for formulas, tables, and figures are given in sections 5.3.8 and 5.9 of P1. Of some interest for linguistic analysis may be the discussion in section 5.8 of methods for disambiguating punctuation marks (or rather, for recording the results of one's own disambiguation).

Textual variants and the markup of critical apparatus are described in section 5.10 of P1; several methods are given, since no consensus has been reached on which if any is to be preferred.

Tags for linguistic analysis, with examples from morphology and syntax, are given in chapter 6 of P1; in order to be as hospitable as possible to widely divergent theories of language, these tags take very little linguistic theory for granted, and as a direct result they are also usable for non-linguistic analysis of texts, though the terminology may feel strange to non-linguists and there are no examples of the tags' use in non-linguistic contexts.

Finally, reference must be had to P1 for a full description of the formal SGML document type declarations provided by the TEI (they are printed in appendix C), the restrictions on SGML features to be used in interchange documents (in chapters 1 and 2 and in appendix B), and the mechanisms provided for changing the TEI's formal SGML declarations (in chapter 8). Note that the extension mechanisms are not implemented in the DTDs provided with P1 version 1; they should be in place for version 2.

Some topics are not covered at all, or covered only cursorily, in version 1 of the Guidelines. Work is in progress on some of these, and the results, if any, will be incorporated in later versions of the Guidelines. Work groups have been formed to study tagging for literary and historical study, further linguistic applications, dictionaries and computational lexica, character set issues, text criticism, hypertext and hypermedia,file physical description of manuscripts and printed books. Their success is not assured; it will depend in large part upon the willingness of the community to help them by providing advice, criticism, and assistance.

List of tags described

N.B. In descriptions of contents, the term paragraph is used for paragraphs and other paragraph-level elements such as notes, figures, and tables.

All tags described here have the following global attributes: id Unique identifier for the element; must begin with a letter, can contain leters, digits, hyphens, and periods. n Name or number of this element; may be any string of characters. Often used for recording traditional reference systems. lang language of the text in this element; if not specified, language is assumed to be the same as in the surrounding context. rendition physical realization of the element in the copy text: `italic', `roman', `display block', etc. Value may be any string of characters abbrev marks any abbreviation (optional). Occurs anywhere, contains text. Attributes: full expanded form of the abbreviation type type of abbreviation (initials, acronym, ...) abstract abstract. Occurs in front matter, contains head, paragraphs, optional trailer. actor name of actor who (first) played a role. Occurs in cast list within front matter, contains text. appendix appendix. Occurs in back matter, contains optional head, paragraphs, optional trailer. author author of a cited work. Optional. Occurs in citations, contains text. back Back matter in text. Optional, contains elements such as appendix, glossary, notes, and others. Back elements all include optional head, paragraphs and optional trailer. back.part Subdivision of back matter not specified in TEI guidelines. Occurs in back, contains text. bibliography Series of bibliographic references. Occurs in back, contains optional head, optional series of paragraphs, and citation list(s). body Principal of three major text divisions (front, body, back). Contains further divisions (div1s) or paragraphs, occurs within text. change Changes made to a text. Occurs in TEI header within revision history element. Contains initials of person making change, date of change, and brief description of change. A non-TEI tag; the strictly TEI conformant equivalent is change.note. change.note Same as change but text is included in elements who, date and what. cited.word marks word mentioned, not used. Occurs anywhere, contains text. citn citations. Occurs in bibliography or in text, contains text and other elements (e.g.title, author). citn.detail citation's page number or section, etc. Occurs in citations, contains text. citn.struct formally structured bibliographic citation. Occurs in text, contains author, title, etc. col.break marks beginning of new column on multi-column pages. Occurs anywhere, contains no text. colophon description at back of book describing where, when, and by whom printed. Occurs in back, contains optional head and series of paragraphs. comment any additional information in bibliographic citation not included in other TEI-specified tags. Occurs in bibliographic citations, contains text. date single dates in any form. Occurs in text, contains text. Attributes (optional): type calendar in which date is expressed value date expressed in form 'yyyy-mm-dd' certainty degree of certainty for date date.range phrases or expressions denoting ranges of dates. Occurs anywhere, contains text. Attributes (optional): type same as for date from beginning of date range to end of date range from.cert degree of certainty for start date to.cert degree of certainty for end date dedication text commending work to someone other than the reader. Occurs in front, contains text. div1 Major structural division within body of work. Occurs in body, contains optional head, optional series of paragraphs, and optional series of div2s. Attributes: name Conventional name for text division category (e.g. book, chapter, poem) ID Unique identifier for division n Mnemonic short name for division div2 Structural subdivision of div1. Occurs in body, contains head, paragraphs, and series of div3s. div3 Structural subdivision of div2. Occurs in body, contains head, paragraphs, and series of div4s. div4 Structural subdivision of div3. Occurs in body, contains optional head and series of paragraphs. doc.author Author's name as given on title page. Occurs in title page, contains text. doc.date Date as given on title page. Occurs in title page, contains text. doc.imprint Imprint as given on title page. Occurs in title page, contains text. doc.title Title of work as given on title page. Occurs in title page, contains text. editor editor, compiler or other person secondarily responsible for text. Occurs in bibliographic citation, contains text. emph marks emphatic (stressed) words. Occurs anywhere, contains text. encoding.declarations prose description of detailed editorial options applied to encoding. Occurs in TEI Header, contains paragraphs or special encoding declaration tags not described here. enum marks an item number or other item label in a list. Occurs in lists, contains text (usually a number). epigraph Quotation, etc. from other work. on title page of text or at head of a section. Occurs in front or within structural units of body and back matter, contains text. file.description identifies electronic text itself and source text. Occurs in TEI header, contains at least title.statement, publication.statement, and source.description. foreign marks a word or phrase not in same language as the surrounding element. Occurs anywhere, contains text. Attributes: lang> language of foreign word foreword text addressed to reader, by author, editor or publisher. Occurs in front, contains text. front front matter in text. Optional, contains elements such as title page, foreward, dedication, abstract. front.part front matter not specified in TEI Guidelines. Occurs in front, contains text. gl.gloss explanatory word or phrase in glossary list. Occurs in glossary lists, contains text. gl.term Word being defined or explained in glossary list. Occurs in glossary lists, contains text. gloss explanatory word or phrase. Occurs anywhere in text, contains text. May use termid to point at term defined. glossary simple list of words with definitions or phrases. Occurs anywhere, contains elements such as gl.term and gl.gloss. head heading of a section, list, or other element. Optional, occurs in list or at begining of any structural unit, contains text. head.gloss heading over the glosses in a glossary list. Optional, occurs in glossary list, contains text. head.term heading over the terms in a glossary list. Optional, occurs in glossary list, contains text. highlighted marks text highlighted in copy text, without identifying reason for the highlighting. Occurs in text, contains text. imprint publication information on text. Optional, occurs in bibliographic citation, contains elements such as publ.city, publisher, publ.date. in.quot marks interruptions of quotation by narrator. Occurs within quotations, contains text. index set of index entries. Occurs in back, contains optional head, paragraphs, and the index, treated as list or glossary list. item element within a list. Occurs in list, contains text. May use global n attribute to give number of item within the list. l marks metrical lines of verse. Occurs in epigraph in front, contains text. line.break marks beginning of a new line. Optional. Occurs anywhere, contains text. list list. Occurs anywhere, contains a series of items or a series of enum, item pairs. Attributes: type marks type of list (ordered or unordered) list.cast list of names, of persons represented in a drama, and possibly of actors who played them. Occurs in front matter, contains elements such as role and actor. list.citn bibliographic-list element. Occurs in text or in bibliography in back, contains series of citn or citn.struct elements. list.gl marks a glossary list. Occurs in text, contains elements such as gl.term and gl.gloss. milestone marks any break-point in text. Occurs anywhere, contains no text. Attributes: unit specifies what type of break is tagged. ed specifies which edition has this break here. name name of person responsible for any aspect of encoding of text (e.g. funding, data capture, encoding, validation). Occurs in statement of responsibility, within title statement of file description in TEI header. note citation note. Occurs anywhere, contains text. Attributes: type type of note. source identifies author of note. place specifies where note appears in copy text. anchored indicates where copy text shows exact place of reference for note. notes a series of notes. Occurs in back, contains a series of notes or other paragraph-level elements. num numbers in any form. Occurs anywhere, contains numerical text. Attributes: type specifies number type, including cardinal, ordinal, fraction, etc. value value of number in application-dependent standard form p paragraph. Occurs within any divN element, contains text. page.break marks beginning of new page. Occurs anywhere, contains text. Attributes: sig signature number (in early printed books) catchword catchword (in early printed books) propname proper nouns of any type. Occurs anywhere, contains text. Attributes: type type of proper name. referent supplies identification for person or thing named normalized gives a normalized form of name publ.city cit or town of publication. Occurs in bibliographic citation, contains text. publ.date date of publication. Occurs in bibliographic citation, contains text. publication.statement details of publication of electronic text. Occurs in TEI Header, contains text. publisher name of publisher of source text or similar agency. Occurs in bibliographic citation, contains text. q marks quotation. Occurs anywhere, contains text. Attributes: rendition can be used to recorded whether quotation is printed in-line or set off as block quotation. In the former case, may identify what kind of quotation marks are used in the copy text to mark this element. q.mark delimits text set off by quotation marks in copy text, without specifying reason for the quotation marks. Occurs anywhere, contains text. rendition same as for q. q.tags explanation of how quotation marks are transcribed. Optional. Occurs in encoding declaration in TEI Header. Contains text. responsible marks individuals and institutions responsible for funding, capture, encoding and validation of electronic text. Occurs in file description in TEI Header. Contains role and name elements. A non-TEI tag; the TEI-conformant equivalent is statement.of.responsibility. revision.history specifies version of file and its history. Occurs in TEI Header, contains series of change.note elements. role marks role of institution or individual in creating electronic text. Occurs in statements of responsibility in file description in TEI Header, contains text. s marks segments of text other than paragraphs. Occurs in text, contains text. salute formulaic slutation at start of epistle (e.g. My Lord or Sir). Occurs in front matter or in any text element, contains text. series series name and volume number. Occurs in bibliographic citation, contains text. signature formulaic salutation at end of epistle, visually distinct and usually containing name. Occurs in front matter or in any text element, contains text. so.called marks words used in a special or ironic sense and placed, in the copy text, within scare quotes. Occurs anywhere, contains quoted text. source.description bibliographic information identifying edition from which machine-readable text has been derived. Occurs in TEI Header, contains same series of elements as file.description, or a bibliographic citation. statement.of.responsibility Describes individuals and organizations responsible for encoding of text. Occurs in TEI Header, contains role and name elements. tei.header Detailed documentation associated with encoding of electronic text. Occurs as a distinct part of encoded text, contains file.description, encoding.declarations and revision.history elements. term marks words highlighted as technical terms. Occurs anywhere, contains text. text marks the transcription of the text, as opposed to the TEI Header. Required. Contains optional front, body, optional back. title title of a work published independently. Occurs in text or citations. Contains text. title.page occurs in front matter, contains doc.title, doc.author, etc. title.part> block of text on title page. Optional. Contains text. title.piece title of a work published as part of a volume, not independently. Occurs in citation or in free text. title.statement group of elements giving title and author of machine readable text. Contains at least title, statement.of.responsibility. trailer explicit closing of a book, chapter, section, etc. Optional. Contains text. what describes a change made to the electronic document. Occurs in change.note, contains text. who identifies maker of a change to the file. Occurs within change.note. Contains text.