Notes on the TEI Guidelines <title>Basic Characteristics and Design Goals <date>&docdate. </titlep> </frontm> <!> <body> <p> As announced elsewhere in this newsletter, the the TEI Guidelines are now available in draft form; over 500 copies have now been sent out, and more are being printed. This interest seems to confirm the view of the organizers of the TEI that we need methods for text encoding suitable for multiple uses of the same texts, for exchange of texts among researchers and others interested, for languages other than English and scripts other than Latin, and which will work with all kinds of text, not only the most common. <p> It is perhaps appropriate now to recapitulate the major goals of the TEI, and attempt a preliminary self-critical evaluation of the first public draft against those goals; that is the purpose of this note. <!> <h1>Who is the TEI for? <p> The goals of the TEI are to define a format for encoding texts in a linear data stream which is suitable for the interchange of textual material between researchers, and to provide concrete recommendations, for those who can use them, as to what features of texts should usually be recorded. As the letterhead puts it, the TEI is an "Initiative for Text Encoding Guidelines and a Common Interchange Format for Literary and Linguistic Data". Note some non-obvious points: <ol> <li>The TEI came out of the community of those using computers to do research on or with texts, and they are our primary constituency. That is: literary scholars, linguists, computational linguists, historians, philosophers, theologians, philologists, people working on machine translation, ... all the people who find their interests reflected in ACH and our sister sponsoring organizations. The publishing industry, database vendors, software developers, and others with commercial interests in electronic text are interested in the TEI, and many are sharing their expertise with us, but they are not the primary constituency. If research and publishing were to turn out to require different things, the TEI would go with the needs of researchers. <p> It's important to note that this is mostly an imaginary issue: so far the requirements of all these groups seem astonishingly close to identical. Very concretely: we have not encountered a single problem faced by humanists which does not have an analogue in a problem faced by linguists, and one in a problem faced by publishers or commercial database vendors--and vice versa. Sometimes the problems look different, but so far most differences have proven superficial. We believe that what will work for researchers must work for other applications as well. So in a real sense, though researchers are the primary constituency, the real intended constituency is everyone who works with electronic text in *any* way, and wants to be able (a) to move the text from system to system without information loss, or (b) to use the text for more than one thing. <li> One major intended use for the Guidelines is as a specification for an interchange format. Transfer between researchers, machines, programs, networks would use such a format very simply: as a description of what a text will look like when it passes from one researcher's hands to another's. An interchange format does not tell anyone what to encode, any more than the ASCII code tells us how to write novels or manuals. What is encoded is the intellectual responsibility of the researcher; no one can take that responsibility away. <li> The other major intended use is as a guide for those encoding texts for general use (and one hopes that that includes most of those encoding texts). The Guidelines should provide a sample set of textual features that many people have found useful in textual work, together with ways of encoding those features. No one is required to encode all those textual features, but the list should (if we do our work right) be taken seriously as a checklist of what the community as a whole tends to find useful. </ol> <p> Software developers should also benefit from the guidelines in both these ways: as a definition of an export-import format (or as an internal file format, if you wish!) and also as a checklist of textual features commonly thought important. Every reader of this newsletter will have seen software which suffered from its makers' sometimes unconsciously narrow conception of the kinds of texts it would be used for -- the Guidelines should be useful as a sort of brain-storming, concept-broadening tool for developers. <!> <h2>Basic requirements <p> The basic requirements for a text encoding scheme have been stated in the NEH proposals for TEI funding. (We should acknowledge once more the debt of the TEI -- and the ACH, ACL, and ALLC -- to the NEH, the EEC, and the Mellon Foundation for their funding. Without them, the project would not be feasible.) <p> An <term>encoding scheme</term> is any (systematic) method of representing or encoding textual data in machine-readable form. Typically, an encoding scheme must include: <ol> <li>methods for recording the characters in the text (including diacritics, special symbols, non-Roman alphabets, etc.) <li>conventions for rendering a text in a single linear sequence (specifying how footnotes, end-notes, critical apparatus, parallel texts, and other non-linear complications are handled) <li>methods for recording logical divisions of texts (e.g. book, chapter, paragraph; act, scene, speech, line; ...) <li>methods for recording analytic information like literary or linguistic analysis <li>conventions for delimiting in-line comments and other ancillary material <li>conventions for identifying the text being encoded and those responsible for encoding it </ol> <p> To create a single encoding scheme suitable for common use, the TEI first formulated (in the original planning conference in 1987 and in working papers since) the following requirements for the scheme to be developed: <ol> <li>It should specify a common interchange format. <li>It should provide a set of recommendations for encoding new textual materials. <li>It should document the existing major schemes and investigate the feasibility of developing a metalanguage in which to describe them. <li>It must be a set of guidelines, not a set of rigid requirements. <li>It must be extensible. <li>It should be device- and software-independent. <li>It should be language-independent. <li>It should be application-independent. </ol> As design goals, it was specified that the guidelines should: <ol> <li>suffice to represent the textual features needed for research <li>be simple, clear, and concrete <li>be easy for researchers to use without special-purpose software <li>allow the rigorous definition and efficient processing of texts <li>provide for user-defined extensions <li>conform to existing and emergent standards </ol> <p> <!> <h2>How we stand <p> The current draft, to no one's surprise does not wholly solve all these problems or wholly fulfill all of the design goals. It wasn't expected to -- some of the hard problems were intentionally saved for the second cycle. Here is a preliminary checklist of where we stand with respect to the goals listed above. <ul> <li>The current draft (version 1.0) does specify both an interchange format and recommendations, though perhaps not as explicitly as one might have expected. It may need to become more explicit in defining the interchange format. <li>It does not document any existing encoding schemes, though work is continuing on that topic. <li>The metalanguage and syntax committee did consider the formulation of a metalanguage for defining existing schemes, but decided against it. Descriptions will take the form of prose and of algorithms for translating from a given scheme into the TEI scheme, using a variety of existing software tools (e.g. sed scripts, Rexx execs, Snobol programs, or even yacc and lex code). <li>It is certainly a set of guidelines rather than requirements, and device- and software-independent. It is also, however, not fully implemented in software -- this has the advantage that the design is not unduly biased by implementation issues, but it makes it hard to demonstrate or validate the scheme. <li>It is extensible, but the mechanisms for specifying extensions need work to be usable without heavy-duty knowledge of SGML. <li>It has no bias that we have consciously put there in favor of any one language, but the TEI has not addressed, let alone solved, the problems of languages other than those already most effectively covered by international data-processing standards. The current draft is silent on topics where people need the most guidance: older forms of languages not covered by ISO standards, Asian scripts, treatment of bidirectional text (e.g. Hebrew and English), and so on. We expect to work on these in the next two years, but for some issues there is little we can do but document and call attention to existing methods of handling these problems (e.g. ISO 10646 or the Unicode effort -- two unfortunately incompatible approaches to handling Chinese and other Asian scripts). <li>It does provide what we think is an adequate basis for handling all the known needs of research; it probably needs extension in many areas to provide not just the basis for the required solutions, but some version of the solutions themselves. <li>It's as simple and clear as we could make it, but we expect to hear about lots of obscurities in the draft. (Those with complaints or, better still, suggestions, are encouraged to contact the editors and file a response form.) There have been numerous requests for a short document introducing only the most basic tags, which would be less intimidating; we hope to be able to provide such a document at some point in the near future. <li>It can be used without special software, at least at the simpler levels. A lot of work is needed, however, before we have something we can hand to the average literary scholar who uses Nota Bene or Word Perfect or Microsoft Word and wants to create a TEI-conformant file. <li>So far, at least, the Guidelines can be used as specified in the ISO standard which defines SGML. There are some technical reasons which mean that the TEI guidelines may not be definable as a <q>conforming application</q> of SGML -- these mostly relate to syntactic freedoms of SGML which are forbidden by the current version of the Guidelines. </ul> <p> There remains, as the list indicates, a great deal to do toward making the Guidelines all they should be. The work of the next two years must include a great deal of revision, extension, and above all testing of the Guidelines on real text in realistic quantities. All who are interested in assisting are urged to contact the editors, whose addresses are given in the publication announcement elsewhere in this issue. </body> </gdoc

.sr docfile = &sysfnam. ;.sr docversion = 'quiet';.im teigmlp1 .* Document proper begins. Notes on the TEI Guidelines <title>Basic Characteristics and Design Goals <date>&docdate. </titlep> </frontm> <!> <body> <p> As announced elsewhere in this newsletter, the the TEI Guidelines are now available in draft form; over 500 copies have now been sent out, and more are being printed. This interest seems to confirm the view of the organizers of the TEI that we need methods for text encoding suitable for multiple uses of the same texts, for exchange of texts among researchers and others interested, for languages other than English and scripts other than Latin, and which will work with all kinds of text, not only the most common. <p> It is perhaps appropriate now to recapitulate the major goals of the TEI, and attempt a preliminary self-critical evaluation of the first public draft against those goals; that is the purpose of this note. <!> <h1>Who is the TEI for? <p> The goals of the TEI are to define a format for encoding texts in a linear data stream which is suitable for the interchange of textual material between researchers, and to provide concrete recommendations, for those who can use them, as to what features of texts should usually be recorded. As the letterhead puts it, the TEI is an "Initiative for Text Encoding Guidelines and a Common Interchange Format for Literary and Linguistic Data". Note some non-obvious points: <ol> <li>The TEI came out of the community of those using computers to do research on or with texts, and they are our primary constituency. That is: literary scholars, linguists, computational linguists, historians, philosophers, theologians, philologists, people working on machine translation, ... all the people who find their interests reflected in ACH and our sister sponsoring organizations. The publishing industry, database vendors, software developers, and others with commercial interests in electronic text are interested in the TEI, and many are sharing their expertise with us, but they are not the primary constituency. If research and publishing were to turn out to require different things, the TEI would go with the needs of researchers. <p> It's important to note that this is mostly an imaginary issue: so far the requirements of all these groups seem astonishingly close to identical. Very concretely: we have not encountered a single problem faced by humanists which does not have an analogue in a problem faced by linguists, and one in a problem faced by publishers or commercial database vendors--and vice versa. Sometimes the problems look different, but so far most differences have proven superficial. We believe that what will work for researchers must work for other applications as well. So in a real sense, though researchers are the primary constituency, the real intended constituency is everyone who works with electronic text in *any* way, and wants to be able (a) to move the text from system to system without information loss, or (b) to use the text for more than one thing. <li> One major intended use for the Guidelines is as a specification for an interchange format. Transfer between researchers, machines, programs, networks would use such a format very simply: as a description of what a text will look like when it passes from one researcher's hands to another's. An interchange format does not tell anyone what to encode, any more than the ASCII code tells us how to write novels or manuals. What is encoded is the intellectual responsibility of the researcher; no one can take that responsibility away. <li> The other major intended use is as a guide for those encoding texts for general use (and one hopes that that includes most of those encoding texts). The Guidelines should provide a sample set of textual features that many people have found useful in textual work, together with ways of encoding those features. No one is required to encode all those textual features, but the list should (if we do our work right) be taken seriously as a checklist of what the community as a whole tends to find useful. </ol> <p> Software developers should also benefit from the guidelines in both these ways: as a definition of an export-import format (or as an internal file format, if you wish!) and also as a checklist of textual features commonly thought important. Every reader of this newsletter will have seen software which suffered from its makers' sometimes unconsciously narrow conception of the kinds of texts it would be used for -- the Guidelines should be useful as a sort of brain-storming, concept-broadening tool for developers. <!> <h2>Basic requirements <p> The basic requirements for a text encoding scheme have been stated in the NEH proposals for TEI funding. (We should acknowledge once more the debt of the TEI -- and the ACH, ACL, and ALLC -- to the NEH, the EEC, and the Mellon Foundation for their funding. Without them, the project would not be feasible.) <p> An <term>encoding scheme</term> is any (systematic) method of representing or encoding textual data in machine-readable form. Typically, an encoding scheme must include: <ol> <li>methods for recording the characters in the text (including diacritics, special symbols, non-Roman alphabets, etc.) <li>conventions for rendering a text in a single linear sequence (specifying how footnotes, end-notes, critical apparatus, parallel texts, and other non-linear complications are handled) <li>methods for recording logical divisions of texts (e.g. book, chapter, paragraph; act, scene, speech, line; ...) <li>methods for recording analytic information like literary or linguistic analysis <li>conventions for delimiting in-line comments and other ancillary material <li>conventions for identifying the text being encoded and those responsible for encoding it </ol> <p> To create a single encoding scheme suitable for common use, the TEI first formulated (in the original planning conference in 1987 and in working papers since) the following requirements for the scheme to be developed: <ol> <li>It should specify a common interchange format. <li>It should provide a set of recommendations for encoding new textual materials. <li>It should document the existing major schemes and investigate the feasibility of developing a metalanguage in which to describe them. <li>It must be a set of guidelines, not a set of rigid requirements. <li>It must be extensible. <li>It should be device- and software-independent. <li>It should be language-independent. <li>It should be application-independent. </ol> As design goals, it was specified that the guidelines should: <ol> <li>suffice to represent the textual features needed for research <li>be simple, clear, and concrete <li>be easy for researchers to use without special-purpose software <li>allow the rigorous definition and efficient processing of texts <li>provide for user-defined extensions <li>conform to existing and emergent standards </ol> <p> <!> <h2>How we stand <p> The current draft, to no one's surprise does not wholly solve all these problems or wholly fulfill all of the design goals. It wasn't expected to -- some of the hard problems were intentionally saved for the second cycle. Here is a preliminary checklist of where we stand with respect to the goals listed above. <ul> <li>The current draft (version 1.0) does specify both an interchange format and recommendations, though perhaps not as explicitly as one might have expected. It may need to become more explicit in defining the interchange format. <li>It does not document any existing encoding schemes, though work is continuing on that topic. <li>The metalanguage and syntax committee did consider the formulation of a metalanguage for defining existing schemes, but decided against it. Descriptions will take the form of prose and of algorithms for translating from a given scheme into the TEI scheme, using a variety of existing software tools (e.g. sed scripts, Rexx execs, Snobol programs, or even yacc and lex code). <li>It is certainly a set of guidelines rather than requirements, and device- and software-independent. It is also, however, not fully implemented in software -- this has the advantage that the design is not unduly biased by implementation issues, but it makes it hard to demonstrate or validate the scheme. <li>It is extensible, but the mechanisms for specifying extensions need work to be usable without heavy-duty knowledge of SGML. <li>It has no bias that we have consciously put there in favor of any one language, but the TEI has not addressed, let alone solved, the problems of languages other than those already most effectively covered by international data-processing standards. The current draft is silent on topics where people need the most guidance: older forms of languages not covered by ISO standards, Asian scripts, treatment of bidirectional text (e.g. Hebrew and English), and so on. We expect to work on these in the next two years, but for some issues there is little we can do but document and call attention to existing methods of handling these problems (e.g. ISO 10646 or the Unicode effort -- two unfortunately incompatible approaches to handling Chinese and other Asian scripts). <li>It does provide what we think is an adequate basis for handling all the known needs of research; it probably needs extension in many areas to provide not just the basis for the required solutions, but some version of the solutions themselves. <li>It's as simple and clear as we could make it, but we expect to hear about lots of obscurities in the draft. (Those with complaints or, better still, suggestions, are encouraged to contact the editors and file a response form.) There have been numerous requests for a short document introducing only the most basic tags, which would be less intimidating; we hope to be able to provide such a document at some point in the near future. <li>It can be used without special software, at least at the simpler levels. A lot of work is needed, however, before we have something we can hand to the average literary scholar who uses Nota Bene or Word Perfect or Microsoft Word and wants to create a TEI-conformant file. <li>So far, at least, the Guidelines can be used as specified in the ISO standard which defines SGML. There are some technical reasons which mean that the TEI guidelines may not be definable as a <q>conforming application</q> of SGML -- these mostly relate to syntactic freedoms of SGML which are forbidden by the current version of the Guidelines. </ul> <p> There remains, as the list indicates, a great deal to do toward making the Guidelines all they should be. The work of the next two years must include a great deal of revision, extension, and above all testing of the Guidelines on real text in realistic quantities. All who are interested in assisting are urged to contact the editors, whose addresses are given in the publication announcement elsewhere in this issue. </body> </gdoc