Trip Report:
American Society for Information Science
C. M. Sperberg-McQueen
25 October 1993
The annual ASIS meeting runs from last Saturday (23 October, a
pre-conference workshop on "Crossing the Internet Threshold") through Thursday
of this week, in Columbus, Ohio, but I only attended one day (25 October), in
order to get back to work on TEI P2 (so this report will be brief).
Attempting to be a good soldier, I attended a continental
breakfast for newcomers to the ASIS conference on Monday morning; arriving a
little late, I found ample fruit and croissants, but no coffee. (I am beginning
to think I have somehow incurred the wrath of the world's coffee gods, and am
fated, until just retribution is exacted, to see the coffee run out just before
I reach the head of the line. If anyone knows what I may have done, please let
me know.) The breakfast was marked with an earnest friendliness on the part of
the organizers which reminded me a little uneasily of Rotary Club meetings, but
another attendee, who is working on the 'packaging and distribution' of
environmental and econometric data for a consortium of (I gathered)
quasi-governmental organizations, did tell me that she had heard of the TEI
Header and thought it might have some relevance to her work. This almost made up
for the coffee. I'm not sure I saw her, however, at the TEI session.
A professor at the University of North Carolina Graduate
School in Library and Information Science, Stephanie Haas, had organized a
session on SGML and the TEI, which took place from 8:30 to 10:00. I gave a rapid
introduction to SGML and its notation; Susan Hockey gave an introduction to the
TEI, its goals, and its organization; and I outlined the contents of TEI P2 and
showed a couple of simple examples. Questions from the audience of between 80
and 120 included:
- You mentioned the possibility of exchanging TEI Headers between sites as a
means of providing holdings information and a substitute for a catalog. What
relationship do you foresee between such headers and MARC records in the local
catalog?
- I am concerned that the methods of analysis and annotation seem so
oriented toward things like morphological analysis. In my experience on Esprit
and other projects, I have come to believe that semantic annotation is the
crucial task; can your tags for annotation handle things like case-grammar
analysis?
- Who is going to do all this tagging?
- The problem with SGML is that until DSSSL is completed, you cannot
describe the physical appearance of the page. If you lose the format
information, however, you lose the archival resilience of the material; how do
you address that problem?
- (from someone whose name tag said he was from the International Atomic
Energy Commission) As a database provider whose data are now in MARC format,
who is looking toward the future, I am interested in considering SGML. But,
although you did not mention it, CCF is also a strong candidate as a release
format for our data. What are the relative strengths of using a specialized
bibliographic format like CCF, compared with a general-purpose format like
SGML?
Both SH and I were impressed by the quality of the questions,
and the number which were new to us.
After the session and quiet discussion with numerous auditors,
SH and I had coffee with Annalies Hoogcarspel of CETH, Elaine Brennan of the
Brown Women Writers Project, Daniel Pitti of Berkeley (who is running a Finding
Aids project and wanted to know whether CONCUR would solve the problem of
encoding finding aids; we decided CONCUR could handle finding aids, but would
not really be necessary), and three people from the Getty Art History
Information Program (Deborah Wilde, Jennifer Trant -- an external consultant
working for AHIP -- and a fellow whose name I lost).
After clearing up the problem of CONCUR for Daniel Pitti, the
discussion moved to how SGML could be applied to the problems of cataloguing or
describing art works and their related materials. JT, in particular, had been
struck by the possibility that an art work -- e.g. a print, or a concrete poem
-- could be encoded directly as a text in TEI form, in which case the TEI header
would need to contain a description of the work. She was concerned, however,
about the problem of establishing and documenting relationships between the work
itself, preparatory materials (e.g. a script for a happening), artifactual
traces of the work (e.g. objects used during the happening), textual or other
surrogates for the work (e.g. a description of the happening written during its
performance, or a videotape of a performance piece), and curatorial descriptions
of a work or of any of its associated materials. I encouraged them to consider
the model of HyTime, with the hub document providing links among an arbitrary
collection of other objects and their surrogates. [They should have no trouble
with this, since David Durand is working on DTD development for the project.]
Since curatorial description, at least, is often a well structured activity, the
Getty people were also interested in defining a canonical form for such
descriptions; since the form varies, however, with the nature of the art work,
they were also interested in techniques like the pizza model or like the
parallel options for bibliographic citations and for dictionary entries, which
allow an encoding scheme to capture the regularity of the majority of instances
of a form, while still accommodating the outliers and eccentrics. On the whole,
the Getty people were clearly much more open to discussion with others and to
learning from other projects than their reputation had led me to expect; they
remarked themselves that the atmosphere at the institution had changed and it
was more open to the outside than before.
After lunch, I (we) attended the meeting of the National
Information Standards Organization (NISO), to hear Eric van Herwijnen speak on
"Standards as Tools for Integrating Information and Technology: the Impact and
Implications of SGML." He began with a brief demo of the DynaText version of his
book Practical SGML, and then spoke about electronic delivery of information as
it is currently practiced (at CERN, for example, *all* publications are now
prepared in electronic form, 95% of them in TeX, and most authors desire to
exert full control over details of page layout), and how it might develop now
that storage is cheap (disk space now costs less than $1 per Mb, if purchased at
1 Gb quantities) and "network speed is no longer a problem." He described how
the development of preprint distribution bulletin boards at Los Alamos and SLAC
-- which now contain 60% of the new articles on physical topics, at the rate of
6000 to 7000 articles per year -- have democratized access to preprints.
Internet discovery tools like WAIS, Gopher, and WWW (or V-6 as he ironically
dubbed it) provide transparent access to documents across the network without
requiring the user to know their actual addresses; this allows documents to be
left at their point of origin instead of being mirrorred across the network,
which causes synchronization problems. In the future, he said, we will all have
the entirety of the world's documents on our desk top. This will lead to a
further worsening of the information glut already induced by publish-or-perish
promotion rules and the geometric growth of publication in modern times (the
number of articles published in humanistic and scientific journals in 1992, he
said, topped a million). Information glut, in turn, leads to the need for
intelligent information retrieval. Within a discipline, however, information
retrieval needs only very minimal intelligence: 90% of searches in the physics
databases are done not using the elaborate subject indices and relevance
measures on which we spend so much work, but on authors' names! Those working in
a field know who else is working there, and know which of their colleagues
produce work worth reading. For this reason, he predicted that in the long run
peer reviewed journals would decline, since the young Turks who actually do the
work of physics do not need the sanction of their established elders who
actually do the work of reviewing (rather than the work of physics) to recognize
and reward good work and bad. For work within a discipline, that is, the current
system of preprint databases with primitive information retrieval techniques
might well suffice. It is for work between fields and on the borders of
established areas that really intelligent systems are required; by 'really
intelligent' systems, he explained, he meant systems which can answer questions
without being asked, which know the environment of the problem area and can
recognized important information in adjacent fields, which know what the user
really needs, which answer the questions the users *should* have asked, which
open a dialog with the user to get information they need (instead of spacing out
for an indefinite period to look for the answer on the net, without giving the
user the ability to interrupt them and bring them back), which have access to
more information than the user even knows exists, which can affect the result of
a query instead of simply deriving it, which can see the logical connections
between groups of disparate facts and which can thus have an opinion of their
own. The pre-requisite for building such intelligent systems is to be able to
describe the semantics of documents, and to perform rhetorical analysis to
enable the system to determine what is actually useful information. Standards
are critical for this development: standards for the interchange of information
among documents and databases, for the expression of links between documents and
databases (HyTime), and for the encoding of documents (TEI, ISO 12083). In
conclusion, he argued, structured documents are important, important enough to
force us to rethink the way we structure documents and to argue in favor of
teaching SGML to school children. Information retrieval needs to focus on
interdisciplinary research. And, owing to market forces, whatever happens can be
reliably expected to be both easy and fun.
The final session of the afternoon was organized by Annalies
Hoogcarspel on the topic "Electronic Texts in the Humanities: Markup and Access
Techniques." Since Elli Mylonas had been unable to come to Columbus, AH had
asked me to substitute for her and speak for a few minutes on SGML markup for
hypertext, using the Comenius example familiar from Georgetown (and elsewhere).
Following my remarks, Elaine Brennan gave an overview of the Women Writers
Project; the initial naivete of the project's plan (the original budget did not
even foresee any need for a programmer), its discovery of SGML, and the problems
posed by the the paucity of SGML software directly usable in the project for
document management, for printing, or for delivery to end users. Michael Neuman
of Georgetown described the Peirce Telecommunity Project and the problems of
dealing with the complex posthumous manuscripts of C. S. Peirce. He defined four
levels of tagging for the project: a basic level, capturing the major structures
and the pagination; a literal level, capturing interlinear and marginal material
insertions and deletions; an editorial level including emendations and
annotations; and an interpretive level including analyses and commentary. He
showed sample encodings, including an attempt at an SGML encoding (mercifully,
in type too small for most of the audience to read in detail). Conceptually, I
was pleased to note, there was nothing in the running transcription of the text
which is not handled by the current draft of chapter PH, though the problem of
documenting Peirce's eccentric compositional techniques remains beyond the scope
of P2. Susan Hockey acted as respondent and did an excellent job of pulling the
session together with a list of salient points, which my notes show thus:
- SGML is clearly the key to making texts accessible in useful ways
- The texts studied by humanists can be extremely complex, can take almost
any form, and can deal with almost any topic
- The structure of these texts is extremely various and complex; overlapping
hierarchies are prominent in many examples, as are texts in multiple versions
- All markup implies some interpretation; it is essential, therefore, to be
able to document what has been marked and why
- The TEI provides useful methods of saying who marked what, where, and why
- The reuse of data (and hence SGML) is important for the longevity of our
electronic texts
- Ancillary information (of the sort recorded in the TEI header) is critical
- Encoding can be performed at various levels, in an incremental process: it
need not all be done at once.
- We need software to help scholars do what they need to do with these
texts: the development of this software must similarly be an incremental,
iterative process
- All of this work is directly relevant to the development of digital
libraries: the provision of images is good, but the fact is that transcripts,
with markup, must also be provided in order to make texts really useful for
scholarly work.
In the question session, Mike Lesk asked Elaine Brennan about
the relative cost of SGML markup vis a vis basic keystroking without SGML. She
replied that no clear distinction could be made, for the WWP, since markup is
inserted at the time of data capture, as well as at proofreading time and later.
Someone asked her whether the WWP had ever thought of extending their terminus
ad quem in order to include more modern material, like Sylvia Plath, and what
copyright issues might be involved. EB replied that the WWP had a fairly full
plate already, with the four to five thousand texts written by women from 1330
to 1830, without asking for more. SH noted that copyright issues had been a
thorn in the side of all work with electronic texts for thirty years or more,
and that they needed to be tackled if we were ever to get beyond the frequent
practice of using any edition, no matter how bad, as long as it is out of
copyright, instead of being able to use the best edition, even if it is in
copyright. Michael Neuman was asked how, if the Peirce Project actually invited
the help of the community at large in solving some of the puzzles of Peirce's
work, it could avoid massive quality control and post-editing problems. MN
hedged his answer manfully, conceding that quality control would be a serious
issue and hoping at the same time that public participation in the markup of
Peirce would be a useful, productive activity, and managing to elude completely
the tricky question of how those two principles could be reconciled.
This was the last session of the day, and for me the last of
the conference; SGML is somewhat less prominent in the program after the first
day, but I think we should regard its visibility on Day 1 as a good sign and a
development to be encouraged.
C. M.
Sperberg-McQueen
Chicago
26 Oct 1993