.* TEI Document No: EDW7 .* Title: Text in the E-Age: Text Encoding And Literary Analysis .* Drafted: .* Revised: .* .im gmlpaper ;.* Use GMLPAPER or GMLGUIDE (or -MLA) .sr docfile = &sysfnam. ;.sr docversion = 'Post-MLA' .im teigml .* Document proper begins. Text in the Electronic Age: <title>Text Encoding and Literary Analysis <author>C. M. Sperberg-McQueen <address>Text Encoding Initiative <aline>University of Illinois at Chicago </address> <date>MLA conference talk, 28 December 1989 </titlep> </frontm> <body> I'd like to acknowledge the financial support of the NEH, the EEC, and the Mellon Foundation for the work of the Text Encoding Initiative, from which these remarks come, and the intellectual contribution of Lou Burnard (whose examples I have pilfered shamelessly) and the members of the TEI working committees, who however should not be held responsible.<fn>This written version of these remarks reproduces my speaking notes; some paragraphs were omitted in the oral presentation so as to keep within my allotted time.</fn> <p>Texts cannot be put into computers. Neither can numbers. Computers can contain and operate on patterns of electronic charges, but not numbers, and not texts. <p>As a result, computers never process anyone's data. They process representations of our data. We know of representations that they are inevitably partial, inevitably interested, and inevitably reveal their authors' conscious and unconscious judgements and biases. We know that they obscure what they do not reveal, and that without them nothing can be revealed at all. In designing representations of texts inside computers, we should reveal what is relevant, and obscure only what we think negligible. <p>As we work more intimately with computers, we want the electronic texts we use to help us in our work, to make easy for us the kinds of work we do with them. As we work more intimately with computers, we will do the kinds of things that our electronic texts make easy for us to do. Tools always shape the hand that wields them, technology always shapes the minds that use it. Reason enough to think about what forms electronic texts should take. <p>The representation of a text within a computer inevitably expresses an opinion about what is important in that text. It is a theory of that one text. The design of an adequate general-purpose representation of texts in machine-readable form, such as the TEI is attempting, embodies in turn a thesis about the kinds of things that are or can be important in texts. <p>This morning I want to talk about the relationship between theory of texts and the design of electronic text markup: how I think they are related and within what theoretical framework I think the work of the Text Encoding Initiative must proceed. I'll present a number of theoretical axioms, with commentary; to begin with (1) some discussion of why text encoding has both theoretical and pragmatic importance; then (2) why the task of the TEI is essentially unbounded and unboundable. Finally, I'll discuss (3) some features of texts which require consideration in any serious markup language. Along the way I will mention some technical implications for electronic markup schemes, including the tag set being developed by the TEI, but my main focus will be on presenting a theoretical position powerful enough to motivate the work, and not on describing the mechanisms of the scheme. There will be a workshop in June in Siegen where the mechanisms will be laid out in loving detail. <h1>I. Theoretical and Pragmatic Significance My axioms are these. <h2>1. Markup reflects a theory of the text. Markup reflects a theory of the text. A markup language reflects a theory of texts in general. <p>Markup is the information in a document other than the <q>contents</q> of the document itself, viewed as a stream of characters. All structural information; all relationships among text, apparatus, and notes; all analytic or interpretive information we wish to include in an electronic text is by definition expressed as markup. <p>One occasionally hears pleas for texts to be distributed free of all markup. <q>Pure ASCII texts</q>, it is claimed, are the best, because they are free of special codes, require no special equipment, and restrict themselves to the objective facts of the text, without subjective analysis or interpretation. This view misses three boats. <p>First, no text is entirely free of markup (in the broad sense) with the possible exceptions of some older Hebrew and Greek manuscripts written in scriptio continua (without word breaks), so the search for a markup-free text is chimerical. <p>Second, no clear boundary can be drawn, for all texts, between the <q>facts</q> of a text and our interpretation. Word boundaries are interpretive if the source is in scriptio continua. Vowels are interpretive if the source is unpointed Hebrew or Arabic. Verse boundaries are interpretive in all editions of <cit>Beowulf</cit>. And yet all these markups can be expressed in ASCII-only texts. <p>Third, the ASCII-only approach -- effectively a restriction to procedural markup readily executed by a teletype machine (tabs, carriage returns, line feeds, and backspaces) -- represents a misguided and inadequate theory of texts, in effect claiming that the only essential part of a text is its sequence of graphemes. No electronic tools working with such text can possibly know anything interesting about it. There is no representation for chapter divisions, or sentences, and so no ability to search for words co-occurring within those contexts. No representation of dialect or language shifts, so no ability to distinguish English 'the' from older German or French 'The'. And of course there is no real way to represent French or German texts, because no publicly documented method of representing diacritics. <p>The goals of simplicity and system-independence are good. The impoverishment of our electronic texts and the tools we build to work with them is not good. If we wish to create serious electronic texts, we must deal seriously with the characteristics of markup. <p>To do that, my first axiom says, implies a theory of texts. This may offend those of us who would prefer to assign theory to a more rarefied atmosphere, or to banish it entirely. But every act has theoretical implications. Markup is no exception. <p>If markup is to express our understanding of a text, a complete markup language must allow us to express any thesis we can formulate about a text. Our markup language therefore presupposes and embodies a theory of the types of theses we can develop about texts. We may even formulate programmatically the rule that if one cannot express a critical idea in markup, the markup language is incomplete. (A crucial precondition here, of course, is that the critical idea be concretely expressible. That is, it cannot be ineffable or impenetrable.) <p>A markup language provides a set of tags, which are used to mark features of the text. The tag set provided by a markup language expresses the conviction of the language-makers that a corresponding set of textual features exists and is worth marking. <!> <h2>2. Our understanding of texts is worth sharing. <!> <p>This, like the underlying assumption that individual texts are worth studying, hardly makes a discussable substantive claim. But as a claim about values, this presupposition is worth making explicit, because it appears not to be universal, despite the fact that we are all here engaged in an annual ritual based on this same assumption. <!> <h1>II. Unboundedness of the Task So much for the basic ideas that markup has both theoretical and pragmatic importance. What characteristics must our markup language have in order to model correctly our universe of texts? <!> <h2>3. No finite markup language can be complete. No finite markup language can be complete, for <sl compact=2> <li>a. there is no finite set of textual features worth marking <li>b. there is no finite set of texts to be tagged <li>c. there is no finite set of uses to which texts may be put </sl> <p>I am claiming that there is no single view of texts which answers all questions. I do not adopt this radical eclecticism merely for practical reasons. I embrace it as a positive theoretical basis: there is no orthodoxy and no heresy in textual study. There is also no canon in textual study: what is written may be studied. This seems hard for some people to accept, but I think it is neither revolutionary nor negotiable. No kind of text is irrelevant to the tasks of textual study. There is, finally, no disciplinary purity: no textual feature may be ruled out of a possible markup language on the grounds that <q>that's not really literary study, that's linguistics,</q> or <q>we don't have to worry about that, it's not really syntax, it's pragmatics.</q> <p>The implication for markup is that any adequate markup language must be extensible, to allow someone working on a new kind of text, or with a new theory of textual features, to add new tags to the language. <p>While according to axiom 3 we cannot finally exclude any topic or field from legitimacy, it is possible at least to include some topics as necessary to any serious markup language. <!> <h1>III. Textual Characteristics with Implications for Markup Before I discuss those topics, I need to clear up a potentially serious misconception. <p>The misconception: everyone is going to have to use every tag proposed by the TEI. If the TEI provides tags for marking coffee stains on authors' manuscripts, it's because the TEI is going to require everyone to mark every coffee stain, every watermark, every piece of broken type, and every allusion to Milton. <p>That's not what I have in mind. Such an approach would be a disaster. No one would use the markup scheme. When I say, therefore, that we have to tag X or Y, I mean not that everyone must tag X and Y, but that any general-purpose markup scheme must provide tags to enable us, when we wish to, to tag X and Y. My basic premise is that a fully adequate tagging scheme allows us to express our understanding of the text. Each component of that understanding must be potentially expressible, though it need not always be expressed in fact. <p>So much for my digression into a common misconception. Back to my axioms. <!> <h2>4. Texts are linguistic objects. <!> <p>We must, therefore, provide tags to describe the linguistic organization of the text. Lexical information (dictionary-form of words), morphological analysis including part of speech, and surface phrase structure of the sentence are obvious minimum requirements. <p>For phonetic studies [Figure 1: London-Lund corpus] like this extract from the London-Lund corpus of spoken English, we'll also need mechanisms for keeping the phonetic or phonemic transcription in synch with the orthographic transcription. There may also be multiple orthographic transcriptions (as in the Pfeffer Corpus of spoken German). <p>Linguistic analysis provides a necessary underpinning for literary study: [Figure 2: MoErike] Burrows's stylistic analysis of Jane Austen relies crucially upon his disambiguation of common words like <q>to</q> and <q>want</q>, and many interpretations of poems rely crucially upon facts like the purity or impurity of rhymes (which represent phonetic hypotheses). <p>The well-known discussion between Emil Staiger and Martin Heidegger over this poem by Eduard MoErike centers ultimately upon their different linguistic understandings of the verb <q>scheint</q> in the last line: Staiger interpreting it as <q>seems</q> and Heidegger as <q>shines</q>. Leo Spitzer then pointed out that in the context of the dialect form <q>in ihm selbst</q> (for High German <q>in sich selbst</q>) in the same line, the verb <q>scheint</q> ought perhaps to be taken in its (Swabian) dialect sense of <q>is beautiful</q>. If we take seriously our goal allowing the expression, in markup, of these literary interpretations, we see at once that linguistic tagging is a necessary prerequisite of any literary tagset. <p>Linguistic analysis may also require marking words separated in the text as belonging together, for example the lexical item <q>get out</q> in the sentence <q>get the hell out of here</q>. Also required will be markup for ambiguity and for multiple levels of a hierarchy (phonology, morphology, syntax). <h2>5. Texts occur in / are realized by physical objects. <p>This seems to me to have several implications. First, [Figure 3: Caedmon] the physical facts of text transmission can affect the text itself, for example by making it hard to read, as in this text of Caedmon's hymn, or, notoriously, in the singed edges of the leaves of the Nowell Codex in which <cit>Beowulf</cit> is preserved. We may need, then, to represent irregular features like <q>the passage obscured by this coffee stain</q> in order to make clear the different text-critical status of the text obscured by stain but partly legible. That is: some textual features are discontiguous and must be so marked. <p>(While we're looking at this page, note that the physical organization of the page does not mirror its literary structure: the text of Caedmon's Hymn is not set off as verse. Note too that the manuscript stain runs across prose and verse, completely independent both of the physical hierarchy of leaf, side, and line, and of the logical structure of prose with embedded verse. Here too we have multiple non-nesting hierarchies.) <p>[Figure 4: Prosser] The importance of the technology of text transmission is also visible in these pages from the First Folio of Shakespeare. I realize you can't read the text, but you can see the crucial points. In the page on the left, the compositor has got more text than he has room for: he compresses white space, wraps lines up instead of down, puts stage directions on the same line as text, and presumably modifies spellings to make more words fit. In the right-hand page, the opposite problem occurs. To stretch the text to fill the page, he swaddles the scene break in white space, gives each stage direction its own line, adds unnecessary interword space at the bottom of the first column, to force a line wrap. Our views on Shakespeare's spelling, and our views as to the relative value of First Folio variants, necessarily are affected by these facts. We must, therefore, be able to record symptoms of crowding or space-filling, as well as (obviously) the page breaks of the Folio, in order to work critically with the Folio text. Also required, of course, are facilities for textual variants and parallel texts. <p>[Figure 5: Ministry of Voices] Some texts vigorously exploit the fact that they will be realized on paper. You're all familiar with Herbert's <q>Easter Wings,</q> which fits in a tradition reaching back to Theocritus and forward to this century. The acrostics (not to mention the images on the page) of these figured texts need to be taken account of in any fully adequate encoding scheme. The page shown here is a modern novel's verbal and graphical representation of sexual intercourse. (Parenthetically, I'll mention that capturing such figured texts poses a thorny problem for markup, not as yet satisfactorily solved.) <p>Also, since we are so used to texts on paper, it seems reasonable to ask that any markup scheme allow us to encode enough information that we can reproduce the text on paper in a recognizable and acceptable form. What is acceptable depends of course on what we want to do with it. <p>[Figure 6: Abraham and Isaac] This can cause problems if we are not ready to handle the graphic images often mixed with our texts. We need some way of noting the interleaving of text and graphics. Fortunately, intercalation of graphics is conceptually fairly simple. <h2>6. Texts are both linear and hierarchical. Texts are (at one level) linear. Their internal segmentation gives them a hierarchical structure as well. <p>The linear stream of characters is the least problematic for machine representation. [Figure 7: Milton, <q>Nativity</q>] The linear text gains structure from its segmentation. We need, therefore, markup to show us boundaries between paragraphs or chapters or sections or poems. This is not always unproblematic, as this edition of Milton shows: from the layout, one might guess that <q>I. The Hymn</q> is a new poem -- most editors make it part of the poem on Christ's Nativity. <p>Canonical referencing schemes are based on the idea of a canonical segmentation of various versions of a text. They aren't always successful, as any Protestant knows who has tried to find a Psalm in the Vulgate. And yet, a uniform system of reference is essential for successful cross- reference within a text or between texts. <p>The hierarchical structure of texts is obvious: chapters are subdivided into sections, subsections, paragraphs and sentences. Plays are divided into acts, scenes, and speeches. Verse is divided into (stanzas,) lines, half-lines, feet, partial-feet. Note the neat hierarchy of each of these. Note, too, that these hierarchies can co-exist and cross each other's boundaries. <p>There are indefinite numbers of specialized hierarchies in our texts. We cannot hope to define all of them in advance. But we must define in advance mechanisms to allow specification of new hierarchies. <p>The visible formal structures of texts are tied to their genre. Markup must therefore be sensitive to genre as an organizing principle of texts. We must take care not to define genres too narrowly: genres also mix and we may find structural elements in unexpected places or unexpected forms as in this epistolary preface to Milton's poetry. [figure 8: Milton preface] <h2>7. Textual cross-references form a structure. Explicit and implicit cross-reference in texts gives them a network structure. <p>When texts refer to themselves and to each other, they create a more complex network of links among texts. [Figure 9: Comenius] Images and text refer to each other. The term for this non-linear characteristic of texts is hypertext. As you see here, it's a concept with a respectable antiquity. (Note too the parallel text in English and Latin. Another use for text synchronization.) Markup can realize such structures with link or reference tags. <h2>8. Texts refer to objects in a real or fictive universe. <p>We need to be able to mark the objects referred to in texts, e.g. place names and personal names. This may be required for stylistic study (to distinguish Mr. Brown from the color brown) or historical study (to see who knew the Paston family) or for subject indexing. <h2>9. Texts are cultural and therefore historical objects. <p>[Figure 10: Moby Dick] As historical objects texts evolve. We may not wholly understand them. In this passage from Moby Dick, for example, we know the displayed text represents the sailors' memorial tablets, that all caps signals a name, italics a date -- but small caps don't seem to have a clear interpretation. So we must have recourse to a simple-minded description of their appearance, when that is all we know. They appear in several manuscripts or editions, they are revised, they share sources with other texts. So we must be able to account for the historical evolution of texts: textual variation, parallel texts, different recensions of the same subject matter. This page of Comenius shows a simple parallel bi-lingual text. Those of you who have read polyglot Bibles know that the problem can be more complex. Those who have seen the publications list of the EEC know that it exists today. <p>[Figure 11: Rhine-Delta] Textual variants differ from translated versions in that the various manuscripts of a work may share the vast bulk of their readings. If we try to realize this sharing structurally, we may end with a distinctly non-linear structure as unlike the simple naked single text as it is unlike the usual typographic page of text with critical apparatus. <p>Perhaps the most striking fact of textual historicity is the accretion of annotation on canonical texts. The reception of a text may become more important, for us, than the text itself. [Figure 12: Glossa ordinaria] When this idea is radically pursued, as in this page of the glossa ordinaria or in the Talmud, the result is a very powerful image of the text as a cultural object. <p>As we struggle to adapt the technology of the computer screen to the representation of the textual universe, it is pleasant to observe this page, with its evidence that we are not the first to face these issues. In this page, I think, we see the result of an ultimately successful struggle to find an expression using earlier technology, that of the printed page, for the same universe of textual phenomena I have been discussing here. Let us hope that we will succeed so elegantly. </body> </gdoc>